## ch-2 How the backpropagation algorithm works

discuss **how to compute** the gradient ( $\nabla C$ ) of the cost function

Heart of backpropagation

- an expression for the partial derivative $∂ C /∂ w$ of the cost function C with respect to any weight w ( or bias b ) in the network
- tells us
**how quickly**the cost changes when

we change the weights and biases

### Warm up: a fast matrix-based approach to computing the output from a neural network

Notation

Vectorized form

- apply the weight matrix to the activations, then add the bias vector, and finally apply the σ function
- a^l=σ(w^la^{l−1}+b^l) =σ(z^l)

### The two assumptions we need about the cost function

- the cost function can be written as an average $ C=\frac{1}{n}∑_xC_x$ over cost functions $C_x$ for individual training examples, x.
- the cost can be written as a
**function of the outputs**from the neural network: $cost C = C ( a^L )$

### The four fundamental equations behind backpropagation

**DEF error ** $ δ^l_j $

**Error in the output layer** $δ^l$

**Error $δ^l$ in terms of the error in the next layer, $δ^l+1$**

**Rate of change of the cost with respect to any bias**

**Rate of change of the cost with respect to any weight**

### Proof of the four fundamental equations

using chain rule

### The backpropagation algorithm

- correctness: because the cost is a function of outputs
- To understand how the cost varies with earlier weights and biases we need to repeatedly apply the chain rule, , working backward through the layers to obtain usable expressions

### Backpropagation: the big picture

a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost