Introduction
In my last post I spoke about the general context of neural networks, namely machine learning algorithms. The problem being solved is one of abstracting rule definitions in solving a problem away from the programmer.
We also saw that neural networks might stand out in their ability to identify useful feature transformations automatically, which would serve this overarching purpose.
Here I’ll describe the fundamentals of neural network architecture, and why this architecture might be useful.
Universality
Let start by examining what a neural network can theoretically do. Remember, the problem at hand is to take “some” input, and say something useful about it. We might want to group some inputs together, for example. In the extreme, most fine-grained case, we might want to distinguish between every possible input. All other problems are a subset of this extreme case. So let’s see if a neural network can do this.
We can translate information into binary format. For example we could say that “red” is
The approach is as follows. If we want to distinguish every possible input, we need an output node for each possible input, to say “yes” if we got that input. Ask yourself, do we have enough output nodes above, if each input node is a bit?
The answer is no: there are
Put briefly, neural networks are universal computation machines on the input, at least in theory (we’ve said nothing of efficiency).
Non-linearity & Learning
It makes sense to parametrise the connecting functions, rather than specify each one explicitly. The idea is that we can “learn” the parameters that accomplish a given objective function best. For example we could have
We’ll look at specific algorithms in the next post, and what issues arise with them. For now, let’s develop some intuition about the transition function. Consider the simple network below:
Consider using a discontinuous function, like a step function:
The problem here is that changing the input (in this case a real number) near
Other functions with similar shape are also used, but the sigmoid gradient simplifies calculations. If
While not necessary, the way I see this gradient as being potentially beneficial is that it has a region of rapid change far from the center, and more sensitive tweaking near the center. This is probably deep and deserves more analysis, I’ll try and touch upon it in future posts.
We can think of the sigmoid as a smoothed out version of the step function. We can change weights by arbitrarily small amounts to change the output by arbitrarily small amounts. Why do we want a “step” at all? In reality, what we want is non-linearity. If we only used linear functions, all compositions would be linear and higher level features would be limited. Perhaps signals like images and sounds, and many other kinds, are just amenable to non-linear analysis.
We also want it to be monotone so that changing
In the meantime, since we’re going to be playing with neural networks lets define more compact and consistent notation.
Notation
At each layer we can call the “input” nodes a vector
Conclusion
We’ve briefly looked at how neural networks can, in theory, distinguish any set of inputs. We’ve highlighted the overall architecture, including the desire for differentiable, non-linear, monotone transition functions.
In the next post, we’ll dive into how a neural net actually “learns” its parameters, multi-layer networks, and gradients.
Acknowledgements
This series of posts is heavily inspired by the writings of Michael Nielsen, Richard Socher, and Christopher Olah, to whom much credit is due. Some of the statistical intuition is also influenced by Trevor Hastie and Rob Tibshirani.