Wanna Make AI Series #3

Wanna Make AI Series #3
Source: Giphy

 Wanna Make AI Series #3

Hello all, welcome to the 3rd part of the "Wanna Make AI Series". Today we will be learning about the loss function and gradient descent and why do we need it and try out some codes to visualize it. Before we begin make sure to check out the first part and the second part of this series as info in those posts is required to build your knowledge step by step.

Also, you can get all the code of this series on our Github.

So, lets begin and dive into the topics.


Some basic understandings

Before describing what loss function is we have to actually understand what loss is.

So, guys as we know that Training any AI model simply means that we want to find good values for the weights and the bias in an attempt to minimize the loss and this process is generally known as Empirical Risk Minimization.

Actually Loss is a penalty that is borne by the model for a bad prediction. Hence it is a number that indicates how much bad the model's prediction was on a single example. If the prediction is perfect then the loss is zero, else it is greater.

Hence, the goal of training any model is to determine such values of weights and biases that have low loss on an average across all examples. Like in the below figure where the arrows represent the loss and the blue lines represent the predictions by the model there is high loss in the left model and low loss in the right model.

fitting_the_linear_regression_model
High loss in left model and low loss in right model


You may think that is there a mathematical function that we can term it as a Loss Function that could aggregate all the individual losses in a meaningful way?

So, here comes the concept of Squared Loss (or L2 Loss).


Squared Loss (or L2 Loss) 

So, Squared loss is a loss function that can be defined as follows for a single example:
= the square of the difference between the label and the prediction
= (observation - prediction(m))^2
= (y - y_dash)^2

But we also need some metric to determine the loss over whole dataset.
Hence, we define Mean Square Error(MSE) which is the average squared loss per example over the whole dataset. And for finding MSE we sum up all the squared losses for individual examples and then divide by the number of examples.

MSE_formula


where: 

(x,y) is an example in which x is the set of features that is used by model to make predictions. And y is the example's label (for example, temperature).

prediction(x) is a function of the weights and bias in combination with the set of features x

D is a data set containing many labeled examples, which are (x,y) pairs.

N is the number of examples in D.

Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.


Reducing the Loss

Till now we saw that how can we determine the loss that a model is incurring. Now, we want that our model should be having the least loss. Hence, for this there are various ways like:

  1. Iterative Approach
  2. Gradient Descent or Batch Gradient Descent 
  3. Stochastic Gradient Descent
  4. Mini Batch Gradient Descent

NOTE: In Gradient Descent or Batch Gradient Descent, we use the whole training data per epoch whereas, in Stochastic Gradient Descent, we use only single training example per epoch and Mini-batch Gradient Descent lies in between of these two extremes, in which we can use a mini-batch(small portion) of training data per epoch, thumb rule for selecting the size of mini-batch is in power of 2 like 32, 64, 128 etc.
For more details: cs231n lecture notes

So, guys lets go through them one by one in some detail.

Iterative Approach:
In this approach the machine learning model is trained by starting an initial guess for the weights and bias and iteratively adjusts those guesses until we get the lowest possible loss. Generally, in the linear regression the initialization of the weights can be done randomly but most often we use the trivial values. This is because the starting values in linear regression are not important.
A basic diagram for this approach can be:
                                                                                   
diagram _for_iterative_approach




Gradient Descent:
It is another optimization algorithm that is based on a convex function and it tweaks its parameters iteratively to minimize a given function to its local minimum. Actually it iteratively steps into the opposite direction of the gradient of a function at the current point because this is the direction of the steepest descent.

The below picture shows the steps taken to find weights to lower the loss.


gradient_descent


The below animation gives a good overview as to how the MSE is reduced by finding a proper value for the weights and biases. Note here in the right curve the parameter "a"  is the weight.

Source: Github



***********************************************

So, guys this was it for today as some people requested to keep the blogs short so that the interest remains to read. And as these topics are not small, hence keeping those requests in mind we will continue the above discussion in the next post of this series.

Hope to see you there.
And as always signing off from today's post with an inspirational quote:

The Pessimist Sees Difficulty In Every Opportunity. The Optimist Sees Opportunity In Every Difficulty. 

– Winston Churchill



***********************************************