Neural Networks and Deep Learning

Using neural nets to recognize handwritten digits

How the backpropagation algorithm works

Improving the way neural networks learn

A visual proof that neural nets can compute any function

Why are deep neural networks hard to train?

In the last chapter we saw how neural networks can
learn their weights and biases using the gradient descent algorithm.
There was, however, a gap in our explanation: we didn't discuss how to
compute the gradient of the cost function. That's quite a gap! In
this chapter I'll explain a fast algorithm for computing such
gradients, an algorithm known as *backpropagation*.

The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn't fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams. That paper describes several neural networks where backpropagation works far faster than earlier approaches to learning, making it possible to use neural nets to solve problems which had previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning in neural networks.

This chapter is more mathematically involved than the rest of the book. If you're not crazy about mathematics you may be tempted to skip the chapter, and to treat backpropagation as a black box whose details you're willing to ignore. Why take the time to study those details?

The reason, of course, is understanding. At the heart of
backpropagation is an expression for the partial derivative

With that said, if you want to skim the chapter, or jump straight to the next chapter, that's fine. I've written the rest of the book to be accessible even if you treat backpropagation as a black box. There are, of course, points later in the book where I refer back to results from this chapter. But at those points you should still be able to understand the main conclusions, even if you don't follow all the reasoning.

Before discussing backpropagation, let's warm up with a fast matrix-based algorithm to compute the output from a neural network. We actually already briefly saw this algorithm near the end of the last chapter, but I described it quickly, so it's worth revisiting in detail. In particular, this is a good way of getting comfortable with the notation used in backpropagation, in a familiar context.

Let's begin with a notation which lets us refer to weights in the
network in an unambiguous way. We'll use

We use a similar notation for the network's biases and activations.
Explicitly, we use

The last ingredient we need to rewrite (23)

in a matrix form is the idea of vectorizing a function such asWith these notations in mind, Equation (23)

can be rewritten in the beautiful and compact vectorized formWhen using Equation (25)

to computeThe goal of backpropagation is to compute the partial derivatives

Okay, so what assumptions do we need to make about our cost function,

The reason we need this assumption is because what backpropagation
actually lets us do is compute the partial derivatives

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:

The backpropagation algorithm is based on common linear algebraic
operations - things like vector addition, multiplying a vector by a
matrix, and so on. But one of the operations is a little less
commonly used. In particular, suppose *elementwise* product of the two vectors. Thus the components of

Backpropagation is about understanding how changing the weights and
biases in a network changes the cost function. Ultimately, this means
computing the partial derivatives *error* in the

To understand how the error is defined, imagine there is a demon in our neural network:

Now, this demon is a good demon, and is trying to help you improve the
cost, i.e., they're trying to find a

Motivated by this story, we define the error

You might wonder why the demon is changing the weighted input

**Plan of attack:** Backpropagation is based around four
fundamental equations. Together, those equations give us a way of
computing both the error

Here's a preview of the ways we'll delve more deeply into the equations later in the chapter: I'll give a short proof of the equations, which helps explain why they are true; we'll restate the equations in algorithmic form as pseudocode, and see how the pseudocode can be implemented as real, running Python code; and, in the final section of the chapter, we'll develop an intuitive picture of what the backpropagation equations mean, and how someone might discover them from scratch. Along the way we'll return repeatedly to the four fundamental equations, and as you deepen your understanding those equations will come to seem comfortable and, perhaps, even beautiful and natural.

**An equation for the error in the output layer, δL:**
The components of

Notice that everything in (BP1)

is easily computed. In particular, we computeEquation (BP1)

is a componentwise expression for