Markov Chain Text Generation – Part 1

This is the first in a series of posts about random text generation algorithms. We’ll cover Markov chain text generators and context-free grammars, which I’ve wanted to experiment with for some time. If you’re not familiar with random text generators, then you may be wondering what I’m talking about and why the heck I want to build one.

Why bother?

I’m interested in this for a few reasons: it forces me to go deeper into Clojure (my language of choice for this project) and to learn Clojurescript (for the webapp portion), I’ll learn a lot, and it will be loads of fun. Also, random text generators keep popping up in the news because they’ve been used to generate fake research papers in computer science, math, and physics that some have published, often in conference proceedings and on the e-print archive. Some fake papers even make it into peer-reviewed journals, to shagrin of academia and the publishing industry. This has highlighted flaws in the peer-review process.

Also, there’s a random text generator based off of Deepak Chopra’s twitter stream. The results are pretty interesting.

What’s a random text generator?

A random text generator is a program that spits out random snippets of text, be they phrases, sentences, paragraphs, or larger things. They can choose individual letters, words, or phrases at random, depending on how they’re designed. Since we’ll focus on Markov chain text generators, this post starts at the beginning, with Markov chains, also called Markov models.

What’s a Markov chain?

We want to generate reasonable, sensible sentences, at random. To do this, we need to know, given an initial word or phrase, which words or phrases are likely to follow. We can model these probabilities by studying existing text.

Consider a sentence as a sequence of N words, w_1, \ldots, w_N. Markov theory says that any given position in the sequence encodes all the information necessary to predict items that come later in the sequence. Or, in mathematical terms, we can write the probability of the sequence as:

P(w_1, \ldots, w_N)=P(w_1)\prod_{n=2}^N P(w_n|w_{n-1})

The term on the right-hand side of the above equation tells us the probability that n-1^{st} word follows the n^{th} word. In Markov chain language, this is called a transition probability. This terminology comes from viewing a Markov chain as a sequence of transitions between states of a system. These probabilities are what we need to calculate.

Calculating Transition Probabilities

To predict which word is likely to follow a given word, we need to know the transition probabilities, which we’ll write as P_{n,n-1} for short. There are several ways to calculate them. This is where language modeling comes into play. Each way to calculate the probabilities corresponds to certain assumptions. We’ll look at three such models.

Unigram model

The simplest assumption is that the next word is independent of previous word, or P_{n,n-1}=P(w_n). This is like selecting the next word entirely at random.

Bigram model

The next simplest assumption is that the next word depends on the previous word, or P_{n,n-1}=P(w_n|w_{n-1}). This is an example of a first-order Markov model.

Trigram model

Also called a second-order Markov model, here we assume the next word depends on the previous two words, or P_{n,n-1}=P(w_n|w_{n-1}, w_{n-2}).

We can illustrate these or any Markov models as directed acyclic graphs. The bigram and trigram models are shown below.

DAG representation of a first and second-order Markov model.
DAG representation of a first and second-order Markov model.

What does a third-order Markov model assume? Looking at Markov models from the state transition perspective, what do these assumptions say about how much the current state depends on the past?

Markov chains have many applications. You’ve already heard of random text generation, but two more that you’ve definitely come across are sentence completion and word completion. Think predictive typing apps, search query suggestions, and PageRank itself (see The Physics of PageRank for more on this).

Trying to get things done

Earlier today, I came across an article on the Time Magazine website that is worth remembering: “Warren Buffett’s Strategy to Maximize Your Focus and Master Your Priorities”. The idea presented in the article is simple, yet powerful. Make a list of your top 25 goals and circle the five that are most important. The twenty that you didn’t select are to be avoided at all costs, until the top five are done. Simple, right?

Aligning actions with priorities

So, this got me thinking. A couple of years ago I read (about half) of the GTD book and dove into vertical planning, laying out my goals according to the airplane analogy David Allen describes in the book.

  • Lifetime/Long-term (50,000 ft)
  • 3-5 year goals (40,000 ft)
  • 1-2 year goals (30,000 ft)
  • Areas of Responsibility (20,000 ft)
  • Current projects (10,000 ft)
  • Tasks to move my projects forward (Ground level)

For a time, it helped me make sure my todos were aligned with my ultimate priorities. While I do think about the vertical view from time to time, I generally find myself chained to the current projects level. I want every task I put on my todo list to be clearly connected to my lifetime goals. To that end, I’ve decided to use a tag for each of my lifetime goals (there are only four). Any task that can’t be tied back to one of the four long-term goals is one that I can safely skip.

For the next month I’m going to try this out. I’m hoping for two things. I’ve already mentioned the first thing; actions clearly aligned with goals. The second thing is that I want to be able to look back at the end of the year and see how much I moved closer towards a 1-2 year, 3-5 year, or lifetime goal. We’ll see how it works out.

Image Credit: Angie Torres/Flickr CC BY-NC-ND 2.0

Data visualization

So, I love data visualization. Unfortunately, I spend most of my time at work coding and don’t have as much time to analyze or visualize data as I’d like to. Though I’ve been working on making time for it lately. I also come across particularly nice ones from time to time, like the one below.

“A Visual History of Nobel Prizes and Notable Laureates, 1901-2012”, from Maria Popova’s Brain Pickings site. The visualization was created by Giorgia Lupi, of Accurat. Without further ado…

In a rush

If you asked my wife, or my closest friends, what my greatest struggle is, they would say “time”. I’ve often felt that there isn’t enough time in the day, or in life, for me to do all the things I’d like to. I feel less like that since spending five months in India last year. But the feeling creeps up every now and then.

Which leads me to why I’m writing about this today. I visited the website of Dr. Elizabeth Lindsey, a fellow of the National Geographic Society, to see if there were any updates recently. The most recent post is It’s About Time…an unforgettable lesson. She talks about a Micronesian man who asks her why she’s always going so fast. She says her film crew is waiting for her. The man pauses and says “You folks have watches but you have no time.”

That’s me sometimes. Trying to do everything, now.

College: The best years of your life?

They say that college will be the best four years of your life. I always thought that was a rather depressing statement. Doesn’t that sort of thinking mean it’s all going to be downhill after college? Don’t get me wrong, college was great. I met some wonderful people, took some great classes, and learned a lot. But it wasn’t close to being the best four years of my life.

After college, I realized what kind of person I want to be. Since then, each moment has been a step in that direction, so long as I maintain focus on what’s important to me, with unwavering attention. Since then, my experience of life has been greater than I would have thought possible, that’s even taking painful events into account, like losing loved ones.

Reflections on 10-ish Days of Blogging

I’ve reached day 10 of the 10 Days to a Better Blog online workshop. (Cue applause track.) The focus is reflection.

Several things come to mind when I think about the workshop. I wrote about my goals for the year, why I want to write, and thought about what I want to write about. I wrote post about my tendency to want to do everything NOW and the need to pace myself, made an About Me page, and changed my blog template to one I’m much happier with. Oh, and it’s taken me closer to 20 days and it was fun too.

One of the most useful things for me was writing down the goals for the year. I don’t think I’ve ever done that before. So I can keep track of my progress, I’m going to dump them into a Google spreadsheet. I’ve never done a personal retrospective before, so it should be interesting come 2016.