ch1 Using neural nets to recognize handwritten digits
Perceptrons
 a type of artificial neuron
 a device that makes decisions by weighing up evidence
perceptron in higher layer
 make a decision at a more complex and more abstract level
manylayer network of perceptrons
 engage in sophisticated decision making
weight
 real numbers expressing the importance of the respective inputs to the output
bias
 a measure of how easy it is to get the perceptron to fire
Use perceptrons to compute any logical function
 NAND gate is universal for computation
Sigmoid neurons
Learning
 DEF: changing the weights and biases over and over to produce better and better output
Problems of Perceptron

step function

a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 0 to 1
sigmoid neuron

Input take on values between 0 an 1

Output also between 0 and 1 which is $ σ( wx + b ) $
$ σ(z) ≡ \frac { 1 }{ 1 + e ^ {z } } $ where $ z = wx + b $
$ σ’(z) = σ(z) (1 σ(z) ) $

The smoothness of σ means that small changes $Δw_j$ in the weights and $Δb$ in the bias will produce a small change $Δoutput$ in the output from the neuron.

$ \Delta output \approx \Sigma _j \frac {\partial output}{ \partial w_j} \Delta w_j + \frac {\partial output}{ \partial b} \Delta b $
The architecture of neural networks
 input layer, output layer, hidden layer
Feed Forward Neural Networks
 no loops in the network
Recurrent neural networks
 feedback loops are possible
 have neurons which fire for some limited duration of time, before becoming quiescent
A simple network to classify handwritten digits
To recognize individual digits we will use a threelayer neural network
 Input: 28 by 28 pixel = 784
 Output: 10 neurons represent for 0 to 9
 Hidden layer: n = 15 as an example
Output layer: why use 10 numbers rather than 4 bits?
 there’s no easy way to relate that most significant bit to simple shapes
?A way to think about Hidden layer
 weighting up evidence and produce an abstraction
Learning with gradient descent
cost function
 DEF: $ C( w,b ) ≡ \frac{1}{2n} ∑ ∥y(x) − a ∥^2 $
 $y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ : Desired output
 a : Output from network
 x : input, n: number of inputs, w : weight, b : bias
Aim of our training algorithm
 minimize the cost C (w, b) as a function of the weights and biases
Why not try to maximize that number?
 the number of images correctly classified is not a smooth function of the weights and biases
Gradient Descent Algorithm
 repeatedly compute the gradient ∇C , and then to move in the opposite direction
$ ΔC ≈ \frac { ∂ C }{∂ w_k} Δ w_k + \frac{ ∂ C } {∂ b_l} Δ b_l ≈ ∇C ⋅ Δv.$
choose Δv so as to make ΔC negative
$Δv = −η∇C $
apply to each component
$ w_k → w_k’ = w_k − η \frac{∂ C}{∂w_k }$
$ b_l → b_l’ = b_l − η \frac{∂ C}{∂b_l }$
Problem
 In practical implementations, η is often varied so that it remains a good approximation, but the algorithm is too slow
Stochastic gradient descent
 used to speed up learning
 estimate the gradient $ ∇C $ by computing $ ∇C_x $ for a small sample of randomly chosen training inputs
$ w_k → w_k’ = w_k − \frac{η}{m} \Sigma_j\frac{∂ C}{∂w_k }$
$ b_l → b_l’ = b_l − \frac{η}{m}\Sigma_j \frac{∂ C}{∂b_l }$
Toward deep learning
The weights and biases in the network were discovered automatically.
And that means we don’t immediately have an explanation of how the network does what it does
A heuristic
 we could use is to decompose the problem into subproblems
 early layers answering very simple and specific questions about the input image
 later layers building up a hierarchy of ever more complex and abstract concepts