Artificial Neural Networks
amazing thing: Popular culture is fond of hyperbole: This
or that thing is ‘amazing’, ‘brilliant’ (in the UK), or ‘incredible’,
when in truth the thing is often neither remarkable nor
surprising. However, it is
not an exaggeration to say that the AlphaGo story is amazing
and even ‘awesome’. In March 2016 AlphaGo defeated South
Korean Go professional Lee
Sedol, the strongest human player of the game.
That part of the story is told in a 90-minute documentary, AlphaGo - The Movie. But, the
story does not end there! The team that developed AlphaGo continued to explore ways
in which the methodology could be strengthened, and in doing so
produced a new version appropriately named AlphaGo Zero. The new version had zero reliance on
human knowledge of the game. Through self-teaching only,
AlphaGo Zero discovered novel strategies, concepts or plays that had
been seen before in the deep history of Go. In a test match it
defeated the original AlphaGo program 100 games to 0.1
Amazing things spring from all fields of
science, not just computer science. Every week we read of new
discoveries in biology or
chemistry or materials science, or some other scientific field.
details of these ‘breakthroughs’ are only
accessible to specialists
in the field of discovery. Nevertheless,
layperson’s perspective, it seems possible at least to appreciate
in the same way one appreciates or enjoys art or music. Indeed the
effort to absorb this flow of scientific and technological discovery
could conceivably enhance one’s enjoyment of life. This page is about
attempt to appreciate more
fully one of the ‘amazing’ developments that contibuted to the
in birds and humans: About
a century ago psychologists and neurophysiologists began to
speculate about how humans and animals learn.2
At first the focus was on
external features of the process, the thing to be learned and the
response of the learner. As so-called ‘stimulus-response’
matured they gave rise to further speculation as to
possible underlying mechanisms.
Did you know that the tufted titmouse
can remember up to thousands of places where it has hidden seeds? I
only recently learned this interesting fact (assuming it is a fact).
However, psychology proved
long ago that even pigeons are capable of learning, and worms
too, provided they haven’t been eaten by pigeons.
human learning takes place in
the brain. Stubbing a toe does not appreciably affect learning,
whereas a moderate bump to the head puts at least a temporary stop to
human brain is packed full of nerve cells
or neurons. There is other stuff in there too, but mostly neurons and
order for learning to occur some sort of
change must happen that involves neurons.
proposed that for
learning to take place the connections between neurons must evolve in
Details came along later—They are still coming.
also holds that human learning takes
place in stages. Some new fact or concept may lodge into memory briefly
and then be lost, unless it is somehow consolidated. Subjectively it
seems that a newly acquired fact is learned instantly, but for most
people—certainly for myself—the phrase ‘use it or lose it’ applies. At
some point in the future, for example, we may recall that certain bird
remember a great many locations for stored food, but then struggle to
dredge up the name of an example species. Artificial neural networks do
not suffer this particular concern. Instead they
designed to reproduce those aspects of real neural networks that appear
most relevant to learning, while ignoring or deemphasizing features
that are less obviously useful.
on the right is from Wikipedia (reproduced under terms
of a Creative Commons license). Real
neurons can have many inputs, up to thousands, called dendrites, and
one output (axon). Neurotransmitter substances convey signals across
the junctions between neurons (synapses). Familiar neurotransmitters
are acetylcholine, dopamine, serotonin, etc.
Neuronal inputs may be differentially
to different neurochemical substances and in different amounts.
However, once the total amount of input stimulation exceeds a neuron’s
firing threshold, an
‘action potential’ is generated. This electro-chemical event propagates
along the axon from the cell body toward the neuron’s output.
An elementary fact about neurons is that
they either fire or they don’t. The ‘action potential’ is an
all-or-nothing phenomenon. While artificial neurons are not encumbered
by this constraint, some neural network architectures intentionally
adopt it. —Although variability in the amplitude of the action
potential is not considered functionally significant, neurons do fire
at different rates. After a neuron fires it is refractory for a brief
interval, whereupon it can fire again. Thus, in theory the rate at
which a neuron fires correlates to potency.
more rapidly a
neuron fires, the greater its influence on the neurons upon which it
impinges, and indirectly upon their successors. Neurons
may be excitatory or inhibitory in their effect on the
cells that they connect to. If connections between neurons can be
strengthened (or weakened) then it is a small step to imagine that
neuronal-level changes can affect the totality of an interconnected
network in such a way as to promote learning.
Another ‘fact’ recalled from long-ago
school classes is that myelinated neurons conduct impulses much faster
than non-myelinated ones. Myelinated neurons are fast (on the order of 100
meters / second, give or take), but much
slower than the slowest computer, let alone the arrays of graphic
processors that power some machine learning applications. There is a
great deal more to the neuron story, as can be
inferred from the Wikipedia image, but these are the basic attributes
that carry over to the artificial
neural networks context.
Artificial neurons,4 the participants in
artificial neural networks, simulate selected features of real neurons.
In particular they may have from one to many inputs, but only one
output. Their inputs are differentially sensitive ( “weight”
parameters) and their outputs include an additional additive
constant term called
The artificial neuron’s activation
function is the analog of ‘all-or-nothing’ firing in real neurons. In
some artificial neural networks activation is a ‘step function’ that
outputs ‘1’ when the input exceeds a specified value, and ‘0’
when the threshold is not met. This
type of activation produces a binary output, just
as the action potential either occurs or doesn’t. However, artificial
neural networks may also implement other types of activation, thus
variety of experimental learning strategies.
The preceding paragraphs describe
essential features of real and artificial neurons. Real neurons
interconnect to form networks that are capable of learning. The rest of
is about how networks of artificial neurons
can be programmed to learn.
a neural network: The
first step in neural network programming is to construct a network, or
specifically to define the architecture of the network to be
trained. The working part of the network (between inputs and outputs)
consists of interconnected layers of neurons. See, for example, the
‘dense’ neural network depicted in the training cycle diagram below.
That network (or the segment reproduced) consists of 3 layers of 8
neurons per layer. Straight lines represent connections.
Biases are not shown, but each neuron (circle) should be assumed to
associated bias and an output activation function, also not shown..
Prior to training, neuron weights are
initialized to random Gaussian values. Biases can be similarly
initialized, or they can be set to 0. In
computer languages the
most commonly implemented random
function produces a rectangular distribution, although
mathematical programming languages generally also have the capability
to produce normally
distributed random numbers. If the implementation’s computer language
does not have native Gaussian random capability, the deficit
is easily remedied using a method called the Box-Muller transform.
Alternatively a Gaussian distribution can be computed externally, for
example using FreeMat, and values imported to
the implementation environment as needed.
The role of feedback in
human learning takes many forms: expression of approval or disapproval
from mother to infant, high or low marks on a school exam, the
proverbial burned finger on touching a hot stove, the hangover that
follows a night on the binge, and so on. Some kinds of feedback may be
awkward to quantify, but they nevertheless produce identifiable gains
in knowledge or skills. For machine learning, it is essentially the
School students and those who remember
their school days are familiar with true/false tests. Suppose that a
Geography teacher gives her class a true/false test. “Quito is the
capital of Ecuador. __(T) __(F)” The student checks (T). That
is plus 1 point or minus 0 points, depending on how the teacher scores
the test. “Quebec is the capital of Canada. __(T) __(F)” The student
checks (T) again and this time the teacher counts minus 1 point in
scoring the test. For each question there is an Expected answer and an
actual answer, the student’s Output. The score for each question
reflects whether the Output conforms to the Expected answer or not.
Such a score is easily quantified and accumulated on a question by
True/False tests force the student to
make a binary decision, which in some cases is an educated guess.
“Qatar borders the Mediterranean Sea. __(T) __(F)” The
student ponders.. Qatar is somewhere over there. “I don’t think that it
is on the Mediterranean, but then I could be wrong.” What if instead of
checking true or false the student could indicate a degree of
confidence, a percent sure number, with 100% corresponding to
definitely true, and 0% meaning certainly false. To hedge his bet the
student marks 20% as the answer. How wrong is his answer? 20%
confidence that the statement is True corresponds to 80% confidence
that it is False and since the correct answer is False the student
should be awarded +0.8 on that question or penalized -0.2.
Neural network outputs are not
necessarily like True/False answers, or degree-of-confidence responses.
Indeed neural networks often have multiple outputs, in some cases large
arrays of outputs. However, the examples I have personally studied have
been mainly of the single-output type, where the output represented a
binary classification such as True/False or Win/Loss. Typically the
output is a number between 0 and 1—not a confidence level, as such, but
a value that evolves toward the limiting values 0 and 1 as ‘training’
Neural network training consists of
running multiple learning trials (3-part cycles). In part 1 ‘forward
the network consumes one or more inputs and produces an output,
somewhat analogous to the geography student’s %-sure guess. In part 2
the teacher scores the network’s output, assigning an Error (also
called the Loss function) that reflects the degree to which the network
missed the mark on that particular training trial (formula above). Back
propagation also does something quasi-magical. It apportions
responsibility for the error among the neurons that make up the
network, not equally, but in relation to their individual contribution
error. Finally, the last of the three parts of each training cycle
updates the network by revising neuron weights and biases using the
computations of part 2, together with another overall parameter that is
appropriately named Learning Rate. Of the three components that make up
each training cycle two are easy to describe and understand, while one
‘back propagation’ is more challenging (or was to me).
Propagation: In a perversely reductionist sense, the
simplest possible network (the ‘null’ network) would consist of input
and output only, with no middle or ‘hidden’ layers. Forward propagation
would perform a simple linear transformation of the input, up to
Output activation might do nothing (identity function), or might
convert the transformed input to ‘0’ or ‘1’ (a step function), or could
activate the result in other ways. Realistically, though, any sort of
functional neural network would need to include at least one hidden
layer between input and output. As it happens, forward propagation is
the same across the entire network. Activated outputs of each layer
serve as inputs to the following layer, same as in the null network example. Within a layer, each neuron’s
inputs are multiplied by the corresponding weights and these products
are summed. After adding bias, the output is passed through
the activation function. This activated output serves as an input to
the next layer.
To reiterate, activation is the analog of a biological neuron’s
attribute and artificial neural network models implement activation in
a variety of ways. The sigmoid function is like an asymptotic step
function. it goes to 0 or 1 in the limit, but is never exactly 0 or 1.
Moreover, its derivative, which is needed in the back propagation step
(as will be explained),
can be efficiently computed, when the function itself has already been
computed: σ'(x) = σ(x) / (1 – σ(x)). [See this
page, for example.] Another common choice of activation
function is ReLU, which stands for ‘rectified linear unit’. The name is
more abstruse than the function itself, which simply maps an input to
unmodified raw value if it is positive, and otherwise to 0. In other
is x ,
and is 0 when x
≤ 0. This computationally
function is seen in many instructional exercises, as well as being
widely used in production applications.
Propagation: Back propagation is based on the ‘rate of
change’ concept. Imagine traveling along a road that has hills
and valleys. The steeper the hill the greater will be the rate
of change in elevation. Going
uphill the change is positive (increasing elevation), and downhill it
is negative (decreasing elevation). The change in elevation is zero at
the top or bottom of a hill or when the road is flat.
Switching from hilly roads to neural
networks,… with the latter it is possible to compute the rate of change
in the loss function (its slope) with respect to individual neuronal
weights and biases throughout the network. To do this you start at the
output end of the network and work backwards using a tool from
mathematics called ‘the chain rule’. It is an iterative
process, where the rate of change with respect to the raw output feeds
back to the penultimate network layer, and from there to the next
preceding layer, and so forth, all the way to the beginning of the
The rate of change computation depends
on the type of activation function. As explained above, the sigmoid
activation function scrunches the entire number line from -∞ to +∞ into
the interval (0, 1). It is not hard to guess how this particular
activation function would be useful in taming a neuron or network’s raw
output. In the graphs of the sigmoid function accompanying the
activation paragraph above, it is clear that the rate of change is
greatest at the middle (x = 0) and least at the asymptotes. Also,
recall that prior to training, the neural network is initialized with
random Gaussian weights (mean = 0). However, as the network learns to
classify inputs, corresponding activated outputs will tend toward
either ‘0’ and ‘1’
(the expected outputs) and the absolute error will decrease.
updates: Updating weights and biases is the third and
final part of each training cycle. As described above, the back
propagation process stores a gradient of partial rates of change for
each output and neuron in the network. When back propagation
complete the ‘update’ component of each learning cycle
applies these stored values to revise the network’s weights and biases.
The figure above illustrates the process for one selected weight in a
network that consists of three hidden layers of three neurons per
layer. Specifically the illustrated calculation is updating the third
weight of the first neuron in layer 2. This weight applies to the
activated output of the third neuron in the preceding layer. Similarly
updating of a bias is illustrated below.
rate is a constant that controls the fineness or
coarseness of updating, hence the overall rate at which outputs change
over successive learning cycles. The value ranges from 0 to 1 and is
typically ascertained through trial and error, as are other network
training parameters. Bias
is a property of neurons, not inputs. Hence each neuron in the network
has a single associated bias. Only gradient and learning rate
participate in updating bias.
Just as there are different
models of machine learning so also people learn in different ways.
‘Learning by doing’ is one way in which to approach the challenge of
understanding a new concept. The method does not work for everyone—or
for every concept. Some ideas are too abstract or too broad in scope to
be tackled in this way. It would not be practical to reproduce at home
a scientific finding that was obtained using a million-dollar
instrument. Luckily, simple neural networks do not require such costly
technology. The following paragraphs summarize my
self-study to the time of this writing. The quest continues!
About four years ago (August
2017) I installed a TensorFlow based image classifier model on a
Raspberry Pi. This was my first direct exposure to a machine learning
application. Of course installing and exercising this
application did not require understanding of how it was designed or
‘trained’, or how it was able to do what it did.
The Python image classifier labeled
about 64,000 images on my computer. Most were photos with either serial
number or date-based file names, in other words uninformative names.
The purpose was to facilitate identifying and then locating photos or
other images whose file locations had been indexed in a database, and
for which I had previously created a browser-based user interface.5 Labels associated with the
selected image above were surprisingly
appropriate, and in general labels generated by classify_image.py
were better than having no meaningful labels for the
photos. However, a great many were bizarrely wrong.
Around the same time as I was
with the TensorFlow model for labeling images, my wife installed and
examined the Intel Neural Compute Stick 2 demo
Atomic Pi. Much later I attempted to implement the same or a
similar demo under Windows 10, but did not
succeed. These explorations were our first introduction to machine
learning applications. Although they were useful in the demonstration
sense, neither of us learned much from them—other than to persevere in
the struggle to make them work!
neural network instructional examples are coded in the Python
programming language. That may be due to
Python’s rich numerical and matrix assets (the NumPy
library, for example), or possibly it is because Python ranks first in
programming languages, according to https://pypl.github.io/PYPL.html.
My personal favorite language is not on the list! Of course, MUMPS is
singularly unsuited to neural network programming. On the other hand it
possible to implement a ‘Hello, world!’ example in any language.
classification is a favored
example for demonstration exercises. More generally,
classification problems of various types are found
in introductory articles about neural networks. This makes
sense, in light of the hope or claim that artificial neural networks
should learn to
generalize beyond training data to additional examples that should be
However, from the point of view of the student who seeks a
basic understanding of how neural
networks work, classification problems that rely on complex inputs seem
daunting. One wishes for a simpler starting point.
thought that occurred to me was the
game of NIM. This is a paper and pencil game6
(or it can be played
objects) that is of about the same complexity as
Tic-Tac-Toe. I confess that at first my thoughts erred on the grandiose
side, thinking by analogy to AlphaGo of a program that would learn NIM
by playing against itself. After reading many articles and their
accompanying illustrative exercises, I revised the
goal. Instead of learning to play the game, my program would learn to
classify game configurations as either winning or losing for the
player on move. Inputs would be simple 3-valued vectors and there would
be only one output. In 3, 5, 7 NIM, 190 game states or patterns are
possible between the start 3, 5, 7 and end 0, 0, 0 of the game (4 ∙ 6 ∙ 8 – 2).
The artificial neural network that
learned the 190 win/loss patterns in 3, 5, 7 NIM consisted of 3 inputs
and 3 ‘hidden’ layers of 16 neurons per layer, plus one output (win or
loss). The network architecture shown above(4)
is not the first that I tried with NIM, but the first to produce 100%
‘learning’ of all game positions, and in a relatively short time frame
(roughly 20 million training trials). The programming language was MUMPS.
At the start of training, input weights
were initialized to random Gaussian values, with mean 0 and standard
1. Biases (not shown in the diagram) were initialized to 0. Training consisted of repeating many
training cycles. In each cycle a randomly selected game configuration
was presented as input. The network ‘guessed’ whether the pattern was a
win or a loss for the player on move—the forward propagation part.
After each such guess, feedback was propagated from the loss function
backwards through the network. After back propagation was complete,
neuron weights and biases were updated, and the cycle repeated with
another randomly selected input. In the training report
(right) the third parameter ‘0’ refers to an input
transform that was not used in this specific exercise. The last
parameter ‘.01’ is the criterion
for correct classification. The value .01 means that to be counted as
the sigmoid activated output must be > 0.99. Similarly, to count
as classifying a loss the activated output must be < 0.01. This
value is not a probability, but can be thought of in a similar way.
The annotated listing above shows how a
different network (5 layers of 5 neurons each) learned to classify a
selected subset of training vectors, reflecting a balance of winning
and losing configurations, but otherwise randomly selected. In this
example, input vectors were transformed. However, I concluded
later that transforming inputs had no effect, except possibly to reduce arithmetic precision.
Surely MUMPS would rank near last
languages one would choose for learning about artificial neural
networks. However, the trouble with Python is that it is too easy to
coast through a plethora of published examples, and be persuaded of
understanding while not exerting sufficient original effort. My
personal study did use other resources than MUMPS including many
published Python examples.7
Without doing this I would have been lost.
Once each part of the training cycle had
been coded and tested I thought of making the MUMPS code self-document.
To that end, copies of the forward, back, and update functions were
modified to generate LaTeX documentation in place of calculating values. The trouble with this idea was that except in the
case of the simplest test networks, output was too voluminous to be
useful. For what it’s worth, the text of illustrations
accompanying the ‘Applying Updates’ section above were generated in
In general it is not possible to reverse
engineer a neural network in order to gain insight into how it does its
magic. Numeric values assigned to weights and biases by the
training process cannot be interpreted in terms of an external
conceptual model. The fact that model-level meaning is completely
within the network is one of the features that distinguish this type of
machine learning from various other attempts to emulate human thinking
and decision making algorithmically.
In the past the
term ‘artificial intelligence’ (AI) was
associated with ideas that were not all that intelligent. However, over
time the term’s meaning has evolved to encompass a range
of specialized research endeavors that individually go by more
names. The study of artificial neural networks is one of many
currently active AI research areas. These paragraphs have explored only
the simplest of artificial neural networks. Advanced models currently
human capability for learning in specialized areas, and could one day
overtake us in learning about the world in general. If they do,
humans will be compelled to adapt to a new order, perhaps the ultimate
challenge of research in this area.
1. Go board image (public domain)
Thorndike image (public
domain) from Wikipedia.
3. Some sources suggest
cells outnumber neurons in the
brain, but the evidence for this claim is unclear. Suffice to say that
neurons are the most interesting of brain contents. They are what
defines a brain in the common functional sense.
4. The artificial neuron diagram
along with other similar illustrative diagrams on this page was
produced by this wonderfully flexible neural
network diagram tool, with annotations added in SnagIt.
5. In another page I described the
enterprise web development tool called EWD.js
that was used for this project.
6. The goal in the normal form of
NIM is to pick up the last remaining object (or cross off the last
mark). —Misère is the opposite. At the start there are 3 piles of
objects (other starting configurations are possible). The most common
game starts with 3, 5, and 7 objects in
the three piles. Each play consists of picking up one or more objects
from one pile only. You cannot pick some from one pile and some from
another—That would not be a game! With perfect play, the first player
Moreover, it is easy to classify each possible configuration as either
a win or loss for the player on move. If n1, n2,
represent the number of objects in each pile, then
the game state is a loss for the player having the move if and only
stands for the bitwise XOR operator.
Since the order of piles has no bearing on the game, for
example, 0, n,
n is the
same as n,
human players have only to remember a few key patterns in order to play
7. Among the many
resources that facilitated this study, the following stand out: Neural
Networks from Scratch in Python (book by Harrison Kinsley
& Daniel Kukieła), ‘How to’ article by Steven Miller part 1 and part 2, and another excellent ‘How to’ article with rich coding
examples by Jason Brownlee.
Project descriptions on this page are
intended for entertainment only.
The author makes no claim as to the accuracy or completeness of the
information presented. In no event will the author be liable for any
damages, lost effort, inability to carry out a similar project, or to
reproduce a claimed result, or anything else relating to a decision to
use the information on this page.