Introduction
In my last post I spoke about the general context of neural networks, namely machine learning algorithms. The problem being solved is one of abstracting rule definitions in solving a problem away from the programmer.
We also saw that neural networks might stand out in their ability to identify useful feature transformations automatically, which would serve this overarching purpose.
Here I’ll describe the fundamentals of neural network architecture, and why this architecture might be useful.
Universality
Let start by examining what a neural network can theoretically do. Remember, the problem at hand is to take “some” input, and say something useful about it. We might want to group some inputs together, for example. In the extreme, most fine-grained case, we might want to distinguish between every possible input. All other problems are a subset of this extreme case. So let’s see if a neural network can do this.
We can translate information into binary format. For example we could say that “red” is $0010$ and blue is $0001$. More compactly we could use $0$ and $1$. Information Theory is a fascinating field. Let’s call the digits of a given input $x_0, x_1, x_2…$. In the first “red” example, we have $x_0=0, x_1=1…$. A neural network is comprised of nodes (I avoid terminology like “neurons” as much as I can, since I consider the problem of connecting this architecture to the human brain a whole other area of research). These nodes are organised into an input layer, an output layer, and one or more hidden layers. Let’s consider one hidden layer first:
The approach is as follows. If we want to distinguish every possible input, we need an output node for each possible input, to say “yes” if we got that input. Ask yourself, do we have enough output nodes above, if each input node is a bit?
The answer is no: there are $2 \times 2 \times 2 = 8$ possible combinations of three bits, so in reality we’ll need $8$ output neurons in the above example, so let’s pretend that we do. Next, what do the connections represent? These can be any function we like. We could manually set all connections to return $0$. More interestingly, if $f_{i,j}$ is the function from input node $x_i$ to hidden layer node $h_j$, we could set them to the indicator function $\mathbb{1} (i=j)$. In other words, $h_i$ just checks to see if $x_i$ is $1$. The output layer then combines these for all possible inputs. For example $o_0$ could be the indicator that $h_0$ and $h_1$ are both $1$, meaning our input looks like $0000…000011$. This is a trivial example, but one could imagine doing more meaninful stuff.
Put briefly, neural networks are universal computation machines on the input, at least in theory (we’ve said nothing of efficiency).
Non-linearity & Learning
It makes sense to parametrise the connecting functions, rather than specify each one explicitly. The idea is that we can “learn” the parameters that accomplish a given objective function best. For example we could have $h_j = \sum_i w_i \cdot x_i$, a linear combination of the inputs, with weights specific to that hidden node. How then do we learn better parameters?
We’ll look at specific algorithms in the next post, and what issues arise with them. For now, let’s develop some intuition about the transition function. Consider the simple network below:
Consider using a discontinuous function, like a step function:
The problem here is that changing the input (in this case a real number) near $0$ can lead to a big change in output, so being slightly wrong can really cost us. So we probably want a smooth, differentiable function. But why might we want something other than a nice and simple linear function? It would after all be differentiable. A commonly used function is the sigmoid, which looks like this:
Other functions with similar shape are also used, but the sigmoid gradient simplifies calculations. If $g(x)$ is the sigmoid, the gradient is given by $g(x) \cdot (1-g(x))$:
While not necessary, the way I see this gradient as being potentially beneficial is that it has a region of rapid change far from the center, and more sensitive tweaking near the center. This is probably deep and deserves more analysis, I’ll try and touch upon it in future posts.
We can think of the sigmoid as a smoothed out version of the step function. We can change weights by arbitrarily small amounts to change the output by arbitrarily small amounts. Why do we want a “step” at all? In reality, what we want is non-linearity. If we only used linear functions, all compositions would be linear and higher level features would be limited. Perhaps signals like images and sounds, and many other kinds, are just amenable to non-linear analysis.
We also want it to be monotone so that changing $w$ up or down yields a predictable change: we don’t have local optima that are hard to find through a gradient search.
In the meantime, since we’re going to be playing with neural networks lets define more compact and consistent notation.
Notation
At each layer we can call the “input” nodes a vector $x \in \mathbb{R}^{m+1}$. There is one extra dimension as we’ll set $x_0=1$ by convention to have $w_0$ be a bias or intercept value. The weight matrix for this layer is then $W \in \mathbb{R}^{(m+1) \times (n+1)}$ with $n$ output nodes. The output nodes are given by $f(W \cdot x) \in \mathbb{R}^{n+1}$ where $f$ could be the sigmoid function (defined pointwise over a vector).
Conclusion
We’ve briefly looked at how neural networks can, in theory, distinguish any set of inputs. We’ve highlighted the overall architecture, including the desire for differentiable, non-linear, monotone transition functions.
In the next post, we’ll dive into how a neural net actually “learns” its parameters, multi-layer networks, and gradients.
Acknowledgements
This series of posts is heavily inspired by the writings of Michael Nielsen, Richard Socher, and Christopher Olah, to whom much credit is due. Some of the statistical intuition is also influenced by Trevor Hastie and Rob Tibshirani.