This is the first in a series of posts about random text generation algorithms. We’ll cover Markov chain text generators and context-free grammars, which I’ve wanted to experiment with for some time. If you’re not familiar with random text generators, then you may be wondering what I’m talking about and why the heck I want to build one.
I’m interested in this for a few reasons: it forces me to go deeper into Clojure (my language of choice for this project) and to learn Clojurescript (for the webapp portion), I’ll learn a lot, and it will be loads of fun. Also, random text generators keep popping up in the news because they’ve been used to generate fake research papers in computer science, math, and physics that some have published, often in conference proceedings and on the e-print archive. Some fake papers even make it into peer-reviewed journals, to shagrin of academia and the publishing industry. This has highlighted flaws in the peer-review process.
Also, there’s a random text generator based off of Deepak Chopra’s twitter stream. The results are pretty interesting.
What’s a random text generator?
A random text generator is a program that spits out random snippets of text, be they phrases, sentences, paragraphs, or larger things. They can choose individual letters, words, or phrases at random, depending on how they’re designed. Since we’ll focus on Markov chain text generators, this post starts at the beginning, with Markov chains, also called Markov models.
What’s a Markov chain?
We want to generate reasonable, sensible sentences, at random. To do this, we need to know, given an initial word or phrase, which words or phrases are likely to follow. We can model these probabilities by studying existing text.
Consider a sentence as a sequence of words, . Markov theory says that any given position in the sequence encodes all the information necessary to predict items that come later in the sequence. Or, in mathematical terms, we can write the probability of the sequence as:
The term on the right-hand side of the above equation tells us the probability that word follows the word. In Markov chain language, this is called a transition probability. This terminology comes from viewing a Markov chain as a sequence of transitions between states of a system. These probabilities are what we need to calculate.
Calculating Transition Probabilities
To predict which word is likely to follow a given word, we need to know the transition probabilities, which we’ll write as for short. There are several ways to calculate them. This is where language modeling comes into play. Each way to calculate the probabilities corresponds to certain assumptions. We’ll look at three such models.
The simplest assumption is that the next word is independent of previous word, or . This is like selecting the next word entirely at random.
The next simplest assumption is that the next word depends on the previous word, or . This is an example of a first-order Markov model.
Also called a second-order Markov model, here we assume the next word depends on the previous two words, or .
We can illustrate these or any Markov models as directed acyclic graphs. The bigram and trigram models are shown below.
What does a third-order Markov model assume? Looking at Markov models from the state transition perspective, what do these assumptions say about how much the current state depends on the past?
Markov chains have many applications. You’ve already heard of random text generation, but two more that you’ve definitely come across are sentence completion and word completion. Think predictive typing apps, search query suggestions, and PageRank itself (see The Physics of PageRank for more on this).