Implementing Minibatch Gradient Descent for Neural Networks
It’s a public knowledge that Python is the de facto language of Machine Learning. It’s not without reason: Python has a very healthy and active libraries that are very useful for numerical computing. Libraries like Numpy, SciPy, Pandas, Matplotlib, etc are very important building blocks for implementing Machine Learning algorithms.
So, let’s back to basic and learn how to create a simple 3-layers Neural Networks, and implement Minibatch Gradient Descent using Numpy.
Defining the Networks
First, the dataset. We’re going to create toy dataset by using Scikit-Learn
To get clearer picture about the dataset, here’s the visualization:
Next, we define our NN model. It will be a three layers networks (single hidden layer):
We will also define two operations: forward propagation and the backpropagation. Let’s digest them one by one. We’re going to define the forward propagation first.
The basic scheme is to do series of dot product from input to hidden layer to output layer. In the hidden layer, we also apply some kind of non-linearity so that our NN could predict non-linear decision boundary. The hottest non-linearity function right now is ReLU, and we’re going to follow that trend.
Now, ReLU is defined by
f(x) = max(0, x). But, rather than doing
np.max(0, x), there’s a neat implementation trick:
x[x < 0] = 0. It’s subjectively more readable.
Once we reach the output layer, we need to make the output to be a probability distribution, Bernoulli to be exact, so we squash the output with Softmax function to get our label distribution.
For the backprop, we start from the top (output layer) down to the input layer. We start by propagating the error to get the gradient of weight between hidden layer and output layer, then we compute the gradient of hidden layer using the error gradient. Using hidden layer gradient, we then compute the gradient of weight between input and hidden layer.
We could use the model already, but it’s meaningless as it will yield random result (because of random initialization). So, what we need to do is to train the networks with our dataset.
The most popular optimization algorithm in Machine Learning is Gradient Descent. It’s a first order method, meaning that we only need to compute the gradient of the objective function to use it. Compared to the second order method like Netwon’s Method or BFGS, where we have to derive the partial second order derivative matrix (Hessian), Gradient Descent is simpler, albeit we have more to fiddle about.
Now, there are three variants of Gradient Descent: Batch, Stochastic, and Minibatch:
- Batch will use full training data at each iteration, with could be very expensive if our dataset is large and our networks have a lot of parameters.
- Stochastic uses only single data point to propagate the error, which would make the convergence slow, as the variance is big (because Law of Large Numbers doesn’t apply).
- Minibatch is combining the best of both worlds. We don’t use full dataset but we also don’t use single data point either. For example, we’re using 50, 100, or 200 random subset of our dataset each time we train the networks. This way, we lower the computation cost, and yet we’re still get lower variance than by using the Stochastic version.
Implementing the Minibatch Gradient Descent
Let’s apply our algorithm of choice Minibatch Gradient Descent:
This is the skeleton of our training algorithm. Notice how we first shuffle the training data first then get chunks from it. That way we will get the random subset with size of
minibatch_size of our full training data. Each of those random subset will be fed to the networks, and then the gradients of that minibatch will be propagate back to update the parameters/weights of the networks.
sgd_step function is like this:
First, we get the gradients of each layer of the networks for the current minibatch. Then we use that gradients to update the weights of each layers of the networks. The update operation is very easy. We just add the gradient of particular weight matrix to our existing weight matrix.
To make the learning better, we scale the gradients with a learning rate hyperparameter. If we think visually, the gradient of a function is the direction of our step, whereas the learning rate is how far we take the step.
Lastly, we look at our
What we do here is to iterate every data point in our minibatch, then feed it to the network and compare the output with the true label from the training data. The error is neatly defined by the difference of the probability of true label with the probability of our prediction. This is so, because we’re implicitly using the Softmax output with Cross Entropy cost function. See here for full derivation: http://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function
Because this is a Minibatch Gradient Descent algorithm, we then accumulate all the informations of the current minibatch. We use all those informations of the current minibatch to do the backprop, which will yield gradients of the networks’ parameters (
Let’s test our model and algorithm. We’re going to repeat the train-test procedure for 100 times then report the average of the accuracy.
With 10 iterations and minibatch size of 50, here’s what we get:
Mean accuracy: 0.8765040000000001, std: 0.0048433442991387705
So, we get about 87% accuracy on average. Not bad at all!
Conclusion and Future Work
Here, we showed how to define a simple Neural Networks model from scratch using Numpy. We also defined the forward and backpropagation operations of the networks. We then wrote an algorithm to train the network: Minibatch Gradient Descent. Finally we tested the trained networks using test dataset.
This model obviously a very simple Neural Networks model. If you noticed, it doesn’t even have bias! Nevertheless it served our purpose: implementing the Minibatch Gradient Descent algorithm.
Think of this post and codes as the foundation of the next posts about the more sophisticated and better optimization algorithms.