Wanna Make AI Series #4

Wanna Make AI Series #4
Source: here
Wanna Make AI Series #4

Hello all, welcome to the 4th part of the "Wanna Make AI Series". Today we will be continuing beyond the 3rd part of this series. Before we begin make sure to check out the first part and the second part of this series as info in those posts is required to build your knowledge step by step.

Also, you can get all the code of this series on our Github.

So, lets begin and continue our journey

Small Recap

Before we recap just want to mention that the 4 techniques that we are discussing are more generally called the Optimization Methods because we are basically optimizing the error. Also the learning rate(α) determines the size of the steps taken to reach the minimum. Below image shows the importance of the Learning rate.
Effects of learning rate on Loss (Source)

So, coming to the recap, In the third part we have discussed about some basic understanding of the loss function and the squared loss. We also discussed the Iterative approach and the gradient descent which is used to reduce the loss in training of an AI model. In this part we will be discussing about Stochastic Gradient Descent as well as the Mini batch Gradient Descent. 

Stochastic Gradient Descent

Stochastic Gradient Descent or SGD in short is another algorithm to reduce the loss. The word ‘stochastic‘ means a system or a process that is linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. 

Actually in the Batch Gradient Descent which we saw earlier in previous post, often in practice computing the cost and gradient for the entire training set can be very slow and sometimes intractable on a single machine if the dataset is too big to fit in main memory. Another issue with batch optimization methods is that they don’t give an easy way to incorporate new data in an ‘online’ setting. Hence we use Stochastic Gradient Descent (SGD) which addresses both of these issues by following the negative gradient of the objective after seeing only a single or a few training examples. The use of SGD In the neural network setting is motivated by the high cost of running back propagation over the full training set. SGD can overcome this cost and still lead to fast convergence.

There is a concept called momentum as well in SGD that allows the neural network to converge more faster than the SGD without momentum. Its just an update to the SGD equation with a velocity vector. Basically Momentum takes into account the past gradients to smooth out the update.

You can read more about it here. Its not that necessary to know the whole equation as of now but if you want to know about it then there is no harm.

You can see the below images that give some comparison of the discussed:

Projection of the error surface (Source)



SGD with and without momentum
SGD without and with Momentum (Source)


Comparison of SGD and Batch Gradient Descent
Comparison of SGD and Batch Gradient Descent (Source)

Mini Batch Gradient Descent

Till now we have seen Batch Gradient Descent which is used to smooth the curves and which directly converges to the minima, SGD that is used when the dataset is large and faster convergence is required on those larger datasets. 
But there is a problem with SGD, that we can use it with only one example at a time and it slows the computations. Hence, to overcome this issue we use a mixture of Batch Gradient Descent and the SGD which is known as Mini Batch Gradient Descent.

So, Mini batch gradient descent splits the training data into small batches which are then used to calculate the error of the model and update the parameters. This helps in robust convergence and it avoids local minima. And it is more efficient that the SGD.

The below image shows a comparison of the above with SGD:

SGD vs Mini Batch Gradient Descent
An illustration of convergence with SGD and Mini-Batch Gradient Descent (Source)

Conclusion

You can see in the below image, a general overview of the Descent Algorithms we have seen till now:
Comparing all 3 Gradient Descents
Comparison of all 3 Gradient Descents (Source)

Hence, as we now have a general overview of some of the basic Optimization methods but there are more complex optimization methods that are used combined with the above like Adam Optimizer, etc. In the next Part of this series we will look into some more aspects and will be moving with more other topics in this series.

Hence, as usual we will end today's blogpost with an inspirational quote:

The Pessimist complains about the wind;
The Optimist expects it to change;
The Realist adjusts the sails.
- William A. Ward


*************************************************************