In the last chapter we saw how neural networks can
learn their weights and biases using the gradient descent algorithm.
There was, however, a gap in our explanation: we didn't discuss how to
compute the gradient of the cost function. That's quite a gap! In
this chapter I'll explain a fast algorithm for computing such
gradients, an algorithm known as backpropagation.
The backpropagation algorithm was originally introduced in the 1970s,
but its importance wasn't fully appreciated until a
famous
1986 paper by
David
Rumelhart,
Geoffrey
Hinton, and
Ronald
Williams. That paper describes several
neural networks where backpropagation works far faster than earlier
approaches to learning, making it possible to use neural nets to solve
problems which had previously been insoluble. Today, the
backpropagation algorithm is the workhorse of learning in neural
networks.
This chapter is more mathematically involved than the rest of the
book. If you're not crazy about mathematics you may be tempted to
skip the chapter, and to treat backpropagation as a black box whose
details you're willing to ignore. Why take the time to study those
details?
The reason, of course, is understanding. At the heart of
backpropagation is an expression for the partial derivative ∂C/∂w of the cost function C with respect to any weight
w (or bias b) in the network. The expression tells us how quickly
the cost changes when we change the weights and biases. And while the
expression is somewhat complex, it also has a beauty to it, with each
element having a natural, intuitive interpretation. And so
backpropagation isn't just a fast algorithm for learning. It actually
gives us detailed insights into how changing the weights and biases
changes the overall behaviour of the network. That's well worth
studying in detail.
With that said, if you want to skim the chapter, or jump straight to
the next chapter, that's fine. I've written the rest of the book to
be accessible even if you treat backpropagation as a black box. There
are, of course, points later in the book where I refer back to results
from this chapter. But at those points you should still be able to
understand the main conclusions, even if you don't follow all the
reasoning.
Before discussing backpropagation, let's warm up with a fast
matrix-based algorithm to compute the output from a neural network.
We actually already briefly saw this algorithm
near
the end of the last chapter, but I described it quickly, so it's
worth revisiting in detail. In particular, this is a good way of
getting comfortable with the notation used in backpropagation, in a
familiar context.
Let's begin with a notation which lets us refer to weights in the
network in an unambiguous way. We'll use wljk to denote the
weight for the connection from the kth neuron in the
(l−1)th layer to the jth neuron in the lth
layer. So, for example, the diagram below shows the weight on a
connection from the fourth neuron in the second layer to the second
neuron in the third layer of a network:
This notation is cumbersome at first, and it does take some work to
master. But with a little effort you'll find the notation becomes
easy and natural. One quirk of the notation is the ordering of the
j and
k indices. You might think that it makes more sense to use
j to refer to the input neuron, and
k to the output neuron, not
vice versa, as is actually done. I'll explain the reason for this
quirk below.
We use a similar notation for the network's biases and activations.
Explicitly, we use blj for the bias of the jth neuron in
the lth layer. And we use alj for the activation of the
jth neuron in the lth layer. The following diagram
shows examples of these notations in use:
With these notations, the activation
alj of the
jth
neuron in the
lth layer is related to the activations in the
(l−1)th layer by the equation (compare
Equation
(4)11+exp(−∑jwjxj−b)
and surrounding
discussion in the last chapter)
alj=σ(∑kwljkal−1k+blj),(23)
where the sum is over all neurons
k in the
(l−1)th layer. To
rewrite this expression in a matrix form we define a
weight
matrix wl for each layer,
l. The entries of the weight matrix
wl are just the weights connecting to the
lth layer of neurons,
that is, the entry in the
jth row and
kth column is
wljk.
Similarly, for each layer
l we define a
bias vector,
bl.
You can probably guess how this works - the components of the bias
vector are just the values
blj, one component for each neuron in
the
lth layer. And finally, we define an activation vector
al
whose components are the activations
alj.
The last ingredient we need to rewrite (23)alj=σ(∑kwljkal−1k+blj)
in a
matrix form is the idea of vectorizing a function such as σ.
We met vectorization briefly in the last chapter, but to recap, the
idea is that we want to apply a function such as σ to every
element in a vector v. We use the obvious notation σ(v) to
denote this kind of elementwise application of a function. That is,
the components of σ(v) are just σ(v)j=σ(vj).
As an example, if we have the function f(x)=x2 then the
vectorized form of f has the effect
f([23])=[f(2)f(3)]=[49],(24)
that is, the vectorized
f just squares every element of the vector.
With these notations in mind, Equation (23)alj=σ(∑kwljkal−1k+blj)
can
be rewritten in the beautiful and compact vectorized form
al=σ(wlal−1+bl).(25)
This expression gives us a much more global way of thinking about how
the activations in one layer relate to activations in the previous
layer: we just apply the weight matrix to the activations, then add
the bias vector, and finally apply the
σ function
*
*By the way, it's this expression that
motivates the quirk in the wljk notation mentioned earlier.
If we used j to index the input neuron, and k to index the
output neuron, then we'd need to replace the weight matrix in
Equation (25)al=σ(wlal−1+bl)
by the transpose of the
weight matrix. That's a small change, but annoying, and we'd lose
the easy simplicity of saying (and thinking) "apply the weight
matrix to the activations".. That global view is often easier and
more succinct (and involves fewer indices!) than the neuron-by-neuron
view we've taken to now. Think of it as a way of escaping index hell,
while remaining precise about what's going on. The expression is also
useful in practice, because most matrix libraries provide fast ways of
implementing matrix multiplication, vector addition, and
vectorization. Indeed, the
code
in the last chapter made implicit use of this expression to compute
the behaviour of the network.
When using Equation (25)al=σ(wlal−1+bl)
to compute al,
we compute the intermediate quantity zl≡wlal−1+bl
along the way. This quantity turns out to be useful enough to be
worth naming: we call zl the weighted input to the neurons
in layer l. We'll make considerable use of the weighted input zl
later in the chapter. Equation (25)al=σ(wlal−1+bl)
is
sometimes written in terms of the weighted input, as al=σ(zl). It's also worth noting that zl has components zlj=∑kwljkal−1k+blj, that is, zlj is just the
weighted input to the activation function for neuron j in layer l.
The goal of backpropagation is to compute the partial derivatives
∂C/∂w and ∂C/∂b of the cost
function C with respect to any weight w or bias b in the
network. For backpropagation to work we need to make two main
assumptions about the form of the cost function. Before stating those
assumptions, though, it's useful to have an example cost function in
mind. We'll use the quadratic cost function from last chapter
(c.f. Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2
). In the notation of
the last section, the quadratic cost has the form
C=12n∑x∥y(x)−aL(x)∥2,(26)
where:
n is the total number of training examples; the sum is over
individual training examples,
x;
y=y(x) is the corresponding
desired output;
L denotes the number of layers in the network; and
aL=aL(x) is the vector of activations output from the network
when
x is input.
Okay, so what assumptions do we need to make about our cost function,
C, in order that backpropagation can be applied? The first
assumption we need is that the cost function can be written as an
average C=1n∑xCx over cost functions Cx for
individual training examples, x. This is the case for the quadratic
cost function, where the cost for a single training example is Cx=12∥y−aL∥2. This assumption will also hold true for
all the other cost functions we'll meet in this book.
The reason we need this assumption is because what backpropagation
actually lets us do is compute the partial derivatives ∂Cx/∂w and ∂Cx/∂b for a single training
example. We then recover ∂C/∂w and ∂C/∂b by averaging over training examples. In fact, with this
assumption in mind, we'll suppose the training example x has been
fixed, and drop the x subscript, writing the cost Cx as C.
We'll eventually put the x back in, but for now it's a notational
nuisance that is better left implicit.
The second assumption we make about the cost is that it can be written
as a function of the outputs from the neural network:
For example, the quadratic cost function satisfies this requirement,
since the quadatic cost for a single training example
x may be
written as
C=12∥y−aL∥2=12∑j(yj−aLj)2,(27)
and thus is a function of the output activations. Of course, this
cost function also depends on the desired output
y, and you may
wonder why we're not regarding the cost also as a function of
y.
Remember, though, that the input training example
x is fixed, and so
the output
y is also a fixed parameter. In particular, it's not
something we can modify by changing the weights and biases in any way,
i.e., it's not something which the neural network learns. And so it
makes sense to regard
C as a function of the output activations
aL alone, with
y merely a parameter that helps define that
function.
The backpropagation algorithm is based on common linear algebraic
operations - things like vector addition, multiplying a vector by a
matrix, and so on. But one of the operations is a little less
commonly used. In particular, suppose s and t are two vectors of
the same dimension. Then we use s⊙t to denote the
elementwise product of the two vectors. Thus the components of
s⊙t are just (s⊙t)j=sjtj. As an example,
[12]⊙[34]=[1∗32∗4]=[38].(28)
This kind of elementwise multiplication is sometimes called the
Hadamard product or
Schur product. We'll refer to it as
the Hadamard product. Good matrix libraries usually provide fast
implementations of the Hadamard product, and that comes in handy when
implementing backpropagation.
Backpropagation is about understanding how changing the weights and
biases in a network changes the cost function. Ultimately, this means
computing the partial derivatives ∂C/∂wljk and
∂C/∂blj. But to compute those, we first
introduce an intermediate quantity, δlj, which we call the
error in the jth neuron in the lth layer.
Backpropagation will give us a procedure to compute the error
δlj, and then will relate δlj to ∂C/∂wljk and ∂C/∂blj.
To understand how the error is defined, imagine there is a demon in
our neural network:
The demon sits at the
jth neuron in layer
l. As the input to the
neuron comes in, the demon messes with the neuron's operation. It
adds a little change
Δzlj to the neuron's weighted input, so
that instead of outputting
σ(zlj), the neuron instead outputs
σ(zlj+Δzlj). This change propagates through later
layers in the network, finally causing the overall cost to change by
an amount
∂C∂zljΔzlj.
Now, this demon is a good demon, and is trying to help you improve the
cost, i.e., they're trying to find a Δzlj which makes the
cost smaller. Suppose ∂C∂zlj has a large
value (either positive or negative). Then the demon can lower the
cost quite a bit by choosing Δzlj to have the opposite sign
to ∂C∂zlj. By contrast, if
∂C∂zlj is close to zero, then the demon
can't improve the cost much at all by perturbing the weighted input
zlj. So far as the demon can tell, the neuron is already pretty
near optimal*
*This is only the case for small changes Δzlj, of course. We'll assume that the demon is constrained to
make such small changes.. And so there's a heuristic sense in
which ∂C∂zlj is a measure of the error in
the neuron.
Motivated by this story, we define the error δlj of neuron
j in layer l by
δlj≡∂C∂zlj.(29)
As per our usual conventions, we use
δl to denote the vector
of errors associated with layer
l. Backpropagation will give us a
way of computing
δl for every layer, and then relating those
errors to the quantities of real interest,
∂C/∂wljk and
∂C/∂blj.
You might wonder why the demon is changing the weighted input zlj.
Surely it'd be more natural to imagine the demon changing the output
activation alj, with the result that we'd be using ∂C∂alj as our measure of error. In fact, if you do
this things work out quite similarly to the discussion below. But it
turns out to make the presentation of backpropagation a little more
algebraically complicated. So we'll stick with δlj=∂C∂zlj as our measure of error*
*In
classification problems like MNIST the term "error" is sometimes
used to mean the classification failure rate. E.g., if the neural
net correctly classifies 96.0 percent of the digits, then the error
is 4.0 percent. Obviously, this has quite a different meaning from
our δ vectors. In practice, you shouldn't have trouble
telling which meaning is intended in any given useage..
Plan of attack: Backpropagation is based around four
fundamental equations. Together, those equations give us a way of
computing both the error δl and the gradient of the cost
function. I state the four equations below. Be warned, though: you
shouldn't expect to instantaneously assimilate the equations. Such an
expectation will lead to disappointment. In fact, the backpropagation
equations are so rich that understanding them well requires
considerable time and patience as you gradually delve deeper into the
equations. The good news is that such patience is repaid many times
over. And so the discussion in this section is merely a beginning,
helping you on the way to a thorough understanding of the equations.
Here's a preview of the ways we'll delve more deeply into the
equations later in the chapter: I'll
give
a short proof of the equations, which helps explain why they are
true; we'll restate
the equations in algorithmic form as pseudocode, and
see how the
pseudocode can be implemented as real, running Python code; and, in
the final
section of the chapter, we'll develop an intuitive picture of what
the backpropagation equations mean, and how someone might discover
them from scratch. Along the way we'll return repeatedly to the four
fundamental equations, and as you deepen your understanding those
equations will come to seem comfortable and, perhaps, even beautiful
and natural.
An equation for the error in the output layer, δL:
The components of δL are given by
δLj=∂C∂aLjσ′(zLj).(BP1)
This is a very natural expression. The first term on the right,
∂C/∂aLj, just measures how fast the cost is
changing as a function of the
jth output activation. If, for
example,
C doesn't depend much on a particular output neuron,
j,
then
δLj will be small, which is what we'd expect. The
second term on the right,
σ′(zLj), measures how fast the
activation function
σ is changing at
zLj.
Notice that everything in (BP1)δLj=∂C∂aLjσ′(zLj)
is easily computed. In
particular, we compute zLj while computing the behaviour of the
network, and it's only a small additional overhead to compute
σ′(zLj). The exact form of ∂C/∂aLj
will, of course, depend on the form of the cost function. However,
provided the cost function is known there should be little trouble
computing ∂C/∂aLj. For example, if we're using
the quadratic cost function then C=12∑j(yj−aj)2,
and so ∂C/∂aLj=(aj−yj), which obviously is
easily computable.
Equation (BP1)δLj=∂C∂aLjσ′(zLj)
is a componentwise expression for δL.
It's a perfectly good expression, but not the matrix-based form we
want for backpropagation. However, it's easy to rewrite the equation
in a matrix-based form, as
δL=∇aC⊙σ′(zL).(BP1a)
Here,
∇aC is defined to be a vector whose components are the
partial derivatives
∂C/∂aLj. You can think of
∇aC as expressing the rate of change of
C with respect to
the output activations. It's easy to see that Equations
(BP1a)δL=∇aC⊙σ′(zL)
and
(BP1)δLj=∂C∂aLjσ′(zLj)
are equivalent, and for that reason from now on we'll
use
(BP1)δLj=∂C∂aLjσ′(zLj)