Building a Simple Neural Network

So… you want to learn about neural networks? Well, you’ve come to the right place. This post won’t focus on the theory behind how neural nets work. There are already numerous blog posts and books for that.1 This focuses on building a neural net in code. So, if you want to skip straight to the code, the repo is on Github.

What is a Neural Network?

There are many ways to answer this question, but the answer that resonates most deeply with me, and is perhaps most fundamental, is that a neural network is basically a function. It transforms its input into its output. One of my old college professors actually wrote a paper, Approximation by superpositions of a sigmoidal function, proving that neural networks can approximate any function. 2 This capability is what makes neural networks so powerful and exciting. And all you need to do is select the right weights (it’s not quite that simple).

The Simplest Neural Network

perceptron

We’ll start with a single perceptron, the simplest model of a neuron. Depending on the size of the inputs, the output is either 1 or -1. An output of 1 means the perceptron is on, -1 means the perceptron is off.3 For our simple example, there’s one input, x, which has weight w. To determine the output, called the activation, we first take the dot product of the input and weight vectors, \sum_{i=1}^{N}w_i \cdot x_i, then pass the result through the sign function to get the activation. You can see the code for this below.

You may be wondering why we have to use the sign function. This is just because we want the output of the perceptron to be 1 or -1. For other problems we might want to have a wider range of output values. In such cases we would replace the step function with something else, like the sigmoid or arctangent. In general though, the activation functions used in neural networks take real-valued input and return output that is limited to specific range or to a set of specific values.

A Simple Problem

We’ll look at a simple binary classification problem. That is, to classify an input as belonging to one of two categories. In such a case, we can map each category to one of the two possible output values of the perceptron. Let’s consider the case where x is zero. In such a case it doesn’t matter what the weights are set to, the activation of the perceptron will always be off. That’s not good. To combat such problems we add a fixed input to the perceptron, called the bias; the bias is generally set to 1. The addition of the bias slightly changes how we compute the output. Now we add a term to the sum we saw above, \sum_{i=1}^{N}w_i \cdot x_i + w_0 x_0, where x_0 is the bias and w_0 is the bias weight:

Let’s put these pieces together:

But how do we pick the right weights? The answer is that we don’t. There’s an algorithm for that, backpropagation, though it seems no one really calls it backpropagation until there are multiple layers. Backpropagation is a fancy way of saying that we propagate the error in the output back to the inputs:

  1. See how far away the prediction of our network is from the expected output
  2. Take a step in weight parameter space in the direction that minimizes the error. If you remember your calculus lessons, this is a step in the negative gradient direction. How a big a step to take depends on the size of the error and on how fast we want to move in that direction. We don’t want to take too big a step or too small step. In the former case, we can easily shoot past the optimal weights and in the latter case we might take a long time to get there.
  3. Return to step 1 and repeat until the error is “small enough”.

We could write steps 1 and 2 in code as

Note that we can also use the squared error instead of the The full training happens when we pass a sequence of input, output pairs to the backProp function. With each call to backProp the weights of the perceptron are altered to decrease future errors. To handle the training, I’ve made a PerceptronTrainer struct and a struct to hold the training data as well.

Training the Perceptron

We want our perceptron to tell us if a given point is above or below a line in the xy plane. You can pick any line you want to, but I’ll take a simple one like y(x)=3x+1. We can generate training data by

  1. picking N input values at random and computing the y value for each
  2. Determine whether the y value is above or below the line.

Then, create a PerceptronTrainer  and pass the training data to it and call the train function.

How Well Does the Perceptron Work?

Let’s pass 100 random inputs to the perceptron and see how often the predictions are correct. We’ll also create a new, untrained perceptron and see how often it’s predictions are correct.

I get anywhere from 88-100% correct for the trained perceptron and about 4-40% correct for the untrained perceptron. Not bad for a simple neural network and a simple problem.


  1. A very nice online book is Michael Nielsen’s Neural Networks and Deep Learning
  2. There were several papers written around the same time that talk about these issues. They’re behind a paywall, but you can probably get them on Sci-Hub: Multilayer feedforward networks are universal approximatorsUniversal approximation of an unknown mapping and its derivatives using multilayer feedforward networksOn the approximate realization of continuous mappings by neural networksApproximation capabilities of multilayer feedforward networks 
  3. Usually, a perceptron’s output is 1 or 0. I have a specific use case in mind for which -1 is more convenient than 0. 

Markov Chain Text Generation – Part 1

This is the first in a series of posts about random text generation algorithms. We’ll cover Markov chain text generators and context-free grammars, which I’ve wanted to experiment with for some time. If you’re not familiar with random text generators, then you may be wondering what I’m talking about and why the heck I want to build one.

Why bother?

I’m interested in this for a few reasons: it forces me to go deeper into Clojure (my language of choice for this project) and to learn Clojurescript (for the webapp portion), I’ll learn a lot, and it will be loads of fun. Also, random text generators keep popping up in the news because they’ve been used to generate fake research papers in computer science, math, and physics that some have published, often in conference proceedings and on the e-print archive. Some fake papers even make it into peer-reviewed journals, to shagrin of academia and the publishing industry. This has highlighted flaws in the peer-review process.

Also, there’s a random text generator based off of Deepak Chopra’s twitter stream. The results are pretty interesting.

What’s a random text generator?

A random text generator is a program that spits out random snippets of text, be they phrases, sentences, paragraphs, or larger things. They can choose individual letters, words, or phrases at random, depending on how they’re designed. Since we’ll focus on Markov chain text generators, this post starts at the beginning, with Markov chains, also called Markov models.

What’s a Markov chain?

We want to generate reasonable, sensible sentences, at random. To do this, we need to know, given an initial word or phrase, which words or phrases are likely to follow. We can model these probabilities by studying existing text.

Consider a sentence as a sequence of N words, w_1, \ldots, w_N. Markov theory says that any given position in the sequence encodes all the information necessary to predict items that come later in the sequence. Or, in mathematical terms, we can write the probability of the sequence as:

P(w_1, \ldots, w_N)=P(w_1)\prod_{n=2}^N P(w_n|w_{n-1})

The term on the right-hand side of the above equation tells us the probability that n-1^{st} word follows the n^{th} word. In Markov chain language, this is called a transition probability. This terminology comes from viewing a Markov chain as a sequence of transitions between states of a system. These probabilities are what we need to calculate.

Calculating Transition Probabilities

To predict which word is likely to follow a given word, we need to know the transition probabilities, which we’ll write as P_{n,n-1} for short. There are several ways to calculate them. This is where language modeling comes into play. Each way to calculate the probabilities corresponds to certain assumptions. We’ll look at three such models.

Unigram model

The simplest assumption is that the next word is independent of previous word, or P_{n,n-1}=P(w_n). This is like selecting the next word entirely at random.

Bigram model

The next simplest assumption is that the next word depends on the previous word, or P_{n,n-1}=P(w_n|w_{n-1}). This is an example of a first-order Markov model.

Trigram model

Also called a second-order Markov model, here we assume the next word depends on the previous two words, or P_{n,n-1}=P(w_n|w_{n-1}, w_{n-2}).

We can illustrate these or any Markov models as directed acyclic graphs. The bigram and trigram models are shown below.

DAG representation of a first and second-order Markov model.
DAG representation of a first and second-order Markov model.

What does a third-order Markov model assume? Looking at Markov models from the state transition perspective, what do these assumptions say about how much the current state depends on the past?

Markov chains have many applications. You’ve already heard of random text generation, but two more that you’ve definitely come across are sentence completion and word completion. Think predictive typing apps, search query suggestions, and PageRank itself (see The Physics of PageRank for more on this).

The Physics of PageRank

Some of you may know that the physicist Kenneth Wilson died on June 15^{th}. His work made a huge impact in physics. Not just particle physics, but condensed matter physics as well. No part of physics which is formulated as a quantum field theory was left untouched. So, what did he do? And what does it have to do with PageRank?

Wilson’s Magnum Opus

In order for theorists to actually calculate anything using quantum field theory (QFT), renormalization theory was needed. Feynman, Tomonoga, and Schwinger each independently introduced renormalization into QFT. Freeman Dyson came along afterwards and showed that the three formalisms mentioned above were actually the same thing. But the theory wasn’t on firm theoretical ground before Wilson’s work. Kenneth Wilson introduced the Renormalization Group (RG). And with the RG, we could calculate things with QFT in a systematic, conceptually coherent manner.

(For more on the history, you can read a great blog post about it at Quantum Frontiers.)

When I was in graduate school, the RG was easily one of the most interesting aspects of theoretical physics. I loved that it provided us with a clear theory of critical phenomena and phase transitions as well.

What does this have to do with PageRank though? Here’s a rough outline of the relationship…

The Renormalization Group

Imagine a system described by a vector of parameters, K, and consider a transformation that shifts the parameters to some K'. We can describe this situation with the transformation equation

K'=\mathcal{R}(K),

where \cal{R} is the RG operator. If the transformation operator is applied repeatedly, then we end up with a sequence of parameter vectors K, K', K'',\ldots:

K^{(n)}=\mathcal{R}(K^{(n-1)})=\cdots=\mathcal{R}^n(K^{(0)}),

where n=0,1,2,\dots and K^{(0)} is our initial K.

If we plotted the initial K^{(0)} and plotted results of the successive transformations, we would have a sequence of points connecting K^{(0)} and K^{(n)}. The sequence of points is the RG flow and is represented by the above equation.

It’s possible that we may have fixed points in our parameter space. A fixed point K^{\ast} that remains unchanged when the transformation is applied:
\mathcal{R}(K^{\ast})=K^{\ast}.

Note that I’m glossing over a lot of detail.

The PageRank Algorithm

At its core, PageRank is an algorithm which answers the following question:

If we select a site in a network or graph at random, what is the likelihood of reaching any other given site?

PageRank determines the answer in an iterative fashion, using equations that look a lot like the RG equations. The iterative algorithm proceeds until the probabilities on the sites stop changing. This is our RG fixed-point.

You can learn more about PageRank and the iterative ranking algorithms which came before in a 2010 arXiv article PageRank: Standing on the shoulders of giants.

I’ll really have to look into this in more depth sometime, especially where the fixed-points are concerned.