[NN&DL-ch1] Using neural nets to recognize handwritten digits

ch-1 Using neural nets to recognize handwritten digits


  • a type of artificial neuron
  • a device that makes decisions by weighing up evidence

perceptron in higher layer

  • make a decision at a more complex and more abstract level

many-layer network of perceptrons

  • engage in sophisticated decision making


  • real numbers expressing the importance of the respective inputs to the output


  • a measure of how easy it is to get the perceptron to fire

Use perceptrons to compute any logical function

  • NAND gate is universal for computation

Sigmoid neurons


  • DEF: changing the weights and biases over and over to produce better and better output

Problems of Perceptron

  • step function
  • a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 0 to 1

sigmoid neuron

  • Input take on values between 0 an 1
  • Output also between 0 and 1 which is $ σ( wx + b ) $

$ σ(z) ≡ \frac { 1 }{ 1 + e ^ {-z } } $ where $ z = wx + b $

$ σ’(z) = σ(z) (1- σ(z) ) $

  • The smoothness of σ means that small changes $Δw_j$ in the weights and $Δb$ in the bias will produce a small change $Δoutput$ in the output from the neuron.

  • $ \Delta output \approx \Sigma _j \frac {\partial output}{ \partial w_j} \Delta w_j + \frac {\partial output}{ \partial b} \Delta b $

The architecture of neural networks

  • input layer, output layer, hidden layer

Feed Forward Neural Networks

  • no loops in the network

Recurrent neural networks

  • feedback loops are possible
  • have neurons which fire for some limited duration of time, before becoming quiescent

A simple network to classify handwritten digits

To recognize individual digits we will use a three-layer neural network

  • Input: 28 by 28 pixel = 784
  • Output: 10 neurons represent for 0 to 9
  • Hidden layer: n = 15 as an example

Output layer: why use 10 numbers rather than 4 bits?

  • there’s no easy way to relate that most significant bit to simple shapes

?A way to think about Hidden layer

  • weighting up evidence and produce an abstraction

Learning with gradient descent

cost function

  • DEF: $ C( w,b ) ≡ \frac{1}{2n} ∑ ∥y(x) − a ∥^2 $
  • $y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ : Desired output
  • a : Output from network
  • x : input, n: number of inputs, w : weight, b : bias

Aim of our training algorithm

  • minimize the cost C (w, b) as a function of the weights and biases

Why not try to maximize that number?

  • the number of images correctly classified is not a smooth function of the weights and biases

Gradient Descent Algorithm

  • repeatedly compute the gradient ∇C , and then to move in the opposite direction

$ ΔC ≈ \frac { ∂ C }{∂ w_k} Δ w_k + \frac{ ∂ C } {∂ b_l} Δ b_l ≈ ∇C ⋅ Δv.$

choose Δv so as to make ΔC negative

$Δv = −η∇C $

apply to each component

$ w_k → w_k’ = w_k − η \frac{∂ C}{∂w_k }$

$ b_l → b_l’ = b_l − η \frac{∂ C}{∂b_l }$


  • In practical implementations, η is often varied so that it remains a good approximation, but the algorithm is too slow

Stochastic gradient descent

  • used to speed up learning
  • estimate the gradient $ ∇C $ by computing $ ∇C_x $ for a small sample of randomly chosen training inputs

$ w_k → w_k’ = w_k − \frac{η}{m} \Sigma_j\frac{∂ C}{∂w_k }$

$ b_l → b_l’ = b_l − \frac{η}{m}\Sigma_j \frac{∂ C}{∂b_l }$

Toward deep learning

The weights and biases in the network were discovered automatically.

And that means we don’t immediately have an explanation of how the network does what it does

A heuristic

  • we could use is to decompose the problem into sub-problems
  • early layers answering very simple and specific questions about the input image
  • later layers building up a hierarchy of ever more complex and abstract concepts