## ch-1 Using neural nets to recognize handwritten digits

### Perceptrons

• a type of artificial neuron
• a device that makes decisions by weighing up evidence perceptron in higher layer

• make a decision at a more complex and more abstract level

many-layer network of perceptrons

• engage in sophisticated decision making

weight

• real numbers expressing the importance of the respective inputs to the output

bias

• a measure of how easy it is to get the perceptron to fire Use perceptrons to compute any logical function

• NAND gate is universal for computation

### Sigmoid neurons

Learning

• DEF: changing the weights and biases over and over to produce better and better output

Problems of Perceptron

• step function
• a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 0 to 1 sigmoid neuron

• Input take on values between 0 an 1
• Output also between 0 and 1 which is $σ( wx + b )$

$σ(z) ≡ \frac { 1 }{ 1 + e ^ {-z } }$ where $z = wx + b$

$σ’(z) = σ(z) (1- σ(z) )$ • The smoothness of σ means that small changes $Δw_j$ in the weights and $Δb$ in the bias will produce a small change $Δoutput$ in the output from the neuron.

• $\Delta output \approx \Sigma _j \frac {\partial output}{ \partial w_j} \Delta w_j + \frac {\partial output}{ \partial b} \Delta b$

### The architecture of neural networks

• input layer, output layer, hidden layer Feed Forward Neural Networks

• no loops in the network

Recurrent neural networks

• feedback loops are possible
• have neurons which fire for some limited duration of time, before becoming quiescent

### A simple network to classify handwritten digits

To recognize individual digits we will use a three-layer neural network

• Input: 28 by 28 pixel = 784
• Output: 10 neurons represent for 0 to 9
• Hidden layer: n = 15 as an example

Output layer: why use 10 numbers rather than 4 bits?

• there’s no easy way to relate that most significant bit to simple shapes

?A way to think about Hidden layer

• weighting up evidence and produce an abstraction

### Learning with gradient descent

cost function

• DEF: $C( w,b ) ≡ \frac{1}{2n} ∑ ∥y(x) − a ∥^2$
• $y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ : Desired output
• a : Output from network
• x : input, n: number of inputs, w : weight, b : bias

Aim of our training algorithm

• minimize the cost C (w, b) as a function of the weights and biases

Why not try to maximize that number?

• the number of images correctly classified is not a smooth function of the weights and biases

Gradient Descent Algorithm

• repeatedly compute the gradient ∇C , and then to move in the opposite direction

$ΔC ≈ \frac { ∂ C }{∂ w_k} Δ w_k + \frac{ ∂ C } {∂ b_l} Δ b_l ≈ ∇C ⋅ Δv.$

choose Δv so as to make ΔC negative

$Δv = −η∇C$

apply to each component

$w_k → w_k’ = w_k − η \frac{∂ C}{∂w_k }$

$b_l → b_l’ = b_l − η \frac{∂ C}{∂b_l }$ Problem

• In practical implementations, η is often varied so that it remains a good approximation, but the algorithm is too slow

Stochastic gradient descent

• used to speed up learning
• estimate the gradient $∇C$ by computing $∇C_x$ for a small sample of randomly chosen training inputs

$w_k → w_k’ = w_k − \frac{η}{m} \Sigma_j\frac{∂ C}{∂w_k }$

$b_l → b_l’ = b_l − \frac{η}{m}\Sigma_j \frac{∂ C}{∂b_l }$

### Toward deep learning

The weights and biases in the network were discovered automatically.

And that means we don’t immediately have an explanation of how the network does what it does

A heuristic

• we could use is to decompose the problem into sub-problems
• early layers answering very simple and specific questions about the input image
• later layers building up a hierarchy of ever more complex and abstract concepts 