ch3 Improving the way neural networks learn
The techniques we’ll develop in this chapter include
 a better choice of cost function, the crossentropy cost function;
 four socalled “regularization” methods ( L1 and L2 regularization, dropout, and artificial expansion of the training data ), which make our networks better at generalizing beyond the training data;
 a better method for initializing the weights in the network;
 a set of heuristics to help choose good hyperparameters ;
The crossentropy cost function
Problem of quadratic function
Artificial neuron has a lot of difficulty learning when it’s badly wrong
w = b = 2.0, η = 0.15
Quadratic cost function $ C = \frac { (ya) ^ 2 }{ 2 } $
 when the neuron’s output is close to 1 ( might be far away from the desired output )
 the curve gets very flat, so $ σ′(z) $ gets very small.
 so $ ∂C/∂w $ and $ ∂C/∂b $ get very small.
Introducing crossentropy
where $ a = σ(z) $ and $ z= \Sigma_j w_jx_j + b $
 It’s a cost function :
 nonnegative
 close to zero when a close to y
 avoids the problem of learning slowing down
 rates not controlled by $ σ’(z) $
Using the quadratic cost when we have linear neurons in the output layer
 all the neurons in the final layer are linear neurons
 outputs are simply $ a_j^L = y_j^L $, sigmoid not applied
 quadratic cost will not slowdown learning
Softmax
 define a new type of output layer
 apply the socalled softmax function to z
output is a set of positive numbers which sum up to 1, can be thought of as a probability distribution
The learning slowdown problem
loglikelihood cost function
probability of $ a_y^L $ close to 1, cost close to 0
$ δ_j ^L = \partial C / \partial z_j^L $
Overfitting
cost on the training data continue decrease while
classification accuracy on the test data stops
cost on the test data starts increase while
classification accuracy on training data rises up to 100%
It’s almost as though our network is merely memorizing the training set, without understanding digits well enough to generalize to the test set.
Detect overfitting
 keeping track of accuracy on the test data as our network trains.
 using validation data: we may find hyperparameters which fit particular peculiarities of the test_data if using test_data for detection.
 early stopping: stop training when accuracy no longer improving
 hold out method: validation_data is kept apart or “held out” from the training_data
Avoid overfitting
 one of the best ways: increase the size of the training data
 regularization
Regularization
L2 regularization ( weight decay )
add an extra term on cost function
 Intuitively, the effect of regularization is to make it so the network prefers to learn small weights
 λ : compromising between finding small weights and minimizing the original cost function
 $ \frac{ηλ}{n}*w $ weights shrink by an amount proportional to w
Why does regularization help reduce overfitting?
the 9th order model is really just learning the effects of local noise
 The smallness of the weights means that the network won’t change too much if we change a few random inputs here and there
 having a large bias doesn’t make a neuron sensitive to its inputs in the same way as having large weights
We don’t have an entirely satisfactory systematic understanding of what’s going on, merely incomplete heuristics and rules of thumb.
L1 regularization
 $ \frac{ηλ}{n} * sgn(w) $weights shrink by a constant amount toward 0
 when $w $ is large, shrink less; when $ w $ is small, shrink more
Dropout
modify the network itself rather than modifying the cost function
 start by randomly deleting half the hidden neurons
 forwardpropagate the input and backpropagate the result, update network
 choosing a new random subset, repeat;
 Finally, run the full network by halving all weights.
Heuristic
 Averaging the effects of a large number of different networks
 a neuron cannot rely on particular other neurons, robust to losing any individual connection
Artificially expanding the training data
The general principle is to expand the training data by applying operations that reflect realworld variation.
An aside on big data and what it means to compare classification accuracies
more training data can sometimes compensate for differences in the machine learning algorithm used
Weight initialization
Up to now
 choose w & b using independent Gaussian random variables
 mean 0 and standard deviation 1
 while z sum up over a total of 501 normalized Gaussian random variables
 z will have a very broad Gaussian distribution
An easy way
 initialize more sharply peak
 mean 0 and standard deviation $ 1 / \sqrt {n_{in}} $
 seems only speeds up learning, doesn’t change the final performance
Connection with Regularization
 L2 regularization sometimes automatically gives us something similar to the new approach to weight initialization
Handwriting recognition revisited: the code
How to choose a neural network’s hyperparameters?
Problem when choose η=10.0 and λ=1000.0
 classification accuracies are no better than chance
Broad strategy
 examine network to achieve results better than chance
Speed up experimentation
 stripping network down to the simplest
 increasing the frequency of monitoring
When having a signal

gradually decrease the frequency of monitoring

experiment with a more complex architecture, adjust η and λ again
Learning Rate
 First estimate the order of threshold for η so that training cost won’t oscillating or increasing
 control the step size in gradient descent, no need to monitor by accuracy
Number of training Epochs
Early stopping
 terminate when classification on validation data stops improving
Learning rate schedule
 μ=0 there’s a lot of friction, the velocity can’t build upuse a large learning rate when weights are badly wrong
 later reduce as we make more finetuned adjustments
The regularization parameter, λ
 starting initially λ=0.0 to determine η
 increase or decrease by factors of 10, get a finetune
 return and reoptimize η again
Minibatch size
 it’s possible to use matrix techniques to compute the gradient update for all examples in a minibatch simultaneously
 using the larger minibatch would speed things up
Other techniques
####Variations on stochastic gradient descent
Hessian technique
incorporates not just the gradient, but also information about how the gradient is changing
$ H $ is a matrix known as the Hessian matrix, whose $ jkth $ entry is $ ∂^2C/∂w_j∂w_k $
 the sheer size of the Hessian matrix make it difficult to compute
Momentumbased gradient descent
introduces a notion of “velocity” and “friction”
 replace the gradient descent update rule $ w→w′=w−η∇C$ by

the “force” ∇C is now modifying v, and the velocity is controlling the rate of change of w.

1−μ as the amount of friction in the system.

μ=1, there is no friction; μ=0 there’s a lot of friction, the velocity can’t build up;
Other models of artificial neuron
tanh neuron
 ranges from 1 to 1, not 0 to 1
 allows both positive and negative activations
 the activations in hidden layers would be equally balanced
Rectified linear unit
 never cause saturate, no corresponding learning slowdown