Wanna Make AI Series #2

wanna-make-ai-series-#2
Source: here

Wanna Make AI Series #2

Hello again! We hope that you all are doing well and hope that you learn't something from the first part of our Wanna Make AI series. If you didn't see that post then you can find it here as it is important for you to understand that first before going further in this post.

So, Today we will be covering 

  1. Different types of Activation Functions 
  2. Which Activation Function to choose
  3. Coding the Activation Functions

which will be expand our knowledge base from the previous post of this series.

You can get the whole code for today's post on our Github 

So, lets dive into it :)

Today's Theory #2

Different types of Activation Functions

As we saw in our earlier post (here), as to how we can make our own model from scratch with just the linear neurons. These linear neurons are easy to compute but these have their own limitations. So, the neural network we talked about earlier was consisting of just the Linear neurons which had no hidden layersLinear because it just has that simple equation which we discussed. But in practice it is seen that Linear neurons are not that great on learning important features and hidden layers are required for the same as they can learn important features from the input data. Hence we need some sort of non-linear functions for it which can be implemented by the neurons.

Hence, the concept of Activation Functions comes into play. But don't get baffled with this term as you have already implement one type of Activation function in our 1st part of this series. Astonished? lets first see what is an Activation Function.

So, Activation Functions or Transfer Functions are the functions that we apply generally at the end of the neural net to get the output.
These are of 2 types:
  1. Linear Activation Function
  2. Non-Linear Activation Function
We basically explained about how a Perceptron gives output as 1 or 0 depending upon the input. So it can be seen as a Linear Activation function but more specifically the function which we discussed can be called a Binary Step function.

Now coming to different types of Non-Linear Activation Functions, the list are:
  • Sigmoid
  • Tanh
  • ReLU (Rectified Linear Unit)
  • Leaky-ReLU 
  • Maxout
  • Parameterised-ReLU
  • Exponential LU 
  • Softmax
  • Swish    ...etc
If you want to know about them in code then don't worry we will code these functions after their explanation.

[ NOTE: The constant "e" which we will be using down is a mathematical constant and is approximately equal to 2.71828 ]

So, below is a brief description about some of the above functions:


Binary Step Function
Its just a function where the output depends upon a threshold value. So, if the value input to the function is greater than that threshold value then the function will get activated i.e it will output 1 and otherwise it will not get activated i.e will output 0. So, it is of the form: 

    F(n) = 1, if n>threshold
            = 0, otherwise



Linear Function
Linear Functions are nothing but a straight line formula that we study in out elementary mathematics.
So, it is of the form: 

    F(t) = m*t + c 

where m is some constant greater than 0 and c is also a constant that is greater than equal to 0.


Sigmoid Function
Sigmoid function is also called Logistic Activation Function. The value for the output of this function ranges from 0 to 1. This type of function is better when we are dealing with probabilistic output as probability of any outcome can only be between 0 and 1 (inclusive).
So, it is of the form:

    F(n) = 1 / ( 1 + e^(-n) )


Tanh Function
Tanh function is the same tanh from our mathematics which is similar to sigmoid but has a range from -1 to 1 and is generally used in the classification problem between 2 classes.
It is of the form:

    F(n) = [2 / ( 1 + e^(-2n) ) ] - 1

or in more simpler way we can write it as:

    F(n) = ( 2 * sigmoid(2*n) ) - 1 


ReLU Function
ReLU basically stands for Rectified Linear Unit and it is the most common function used in Deep Learning. Its value for a negative input is straight forward(immediately) 0 and this is the reason why it doesn't activate all the neurons at the same time i.e the output will be non-negative only when the input is positive, which means it will only activate the neurons( give output ) when we have a positive input to this function.
It is of the form:
    
    F(n) = n, if n>=0
            = 0, otherwise

or in more simpler way we can write it as:

    F(n) = max( n , 0 )

But there is an issue here that when we calculate gradient(derivative) of this function, which is an essential step in the back-propagation process, we see that the weights and the bias values don't get updated in the iterations and hence decreased the ability of the model to fit or get trained from the data properly because this can create some dead neurons that never get activated.

Hence to over come this issue we come with another activation function called Leaky-ReLU


[NOTE: We need to calculate the derivative in all our Deep Learning or Machine Learning implementation because we want to know as to which direction should we move to reduce the error of our model and how much should we update the weights and biases depending upon the slope]


Leaky-ReLU
So, Leaky-ReLU is just the enhanced version of ReLU which overcomes the dying ReLU problem.

Here instead of outputting 0 for the negative values we output a very small portion of the input for negative inputs.

It can be shown as:

    F(n) = n,           if n>=0
            = 0.01*n , otherwise


Maxout function
It is a generalization of the ReLU and the leaky ReLU functions. It is known as a piecewise linear function that returns the maximum of the inputs. 
It is generally used with the Dropout of neurons.

Dropout is just the dropping of neurons in a neural network to basically avoid overfitting of the data which we are training our model on. 

Maxout function can be written as:

    F(n) = max( wT1*x + b1  ,  wT2*x + b2 )

where T = Transpose

Both ReLU and Leaky ReLU are a special case of this form for example for ReLU we have w1=0 and b1=0. So, it enjoys the benefits of ReLU( linear operation with no saturation) and doesn't have the dying ReLU problem.



Parameterised-ReLU
Its just a variant of Leaky-ReLU where instead of multiplying the input with 0.01 we multiply it with a parameter "a". Here, "a" is a trainable parameter and this function is used when the dying ReLU problem still can't be solved by the Leaky-ReLU.

It can be shown as:
    F(n) = n,       if n>=0
            = a*n , otherwise
    


Exponential Linear Unit or ELU
Now again its a modification of the Parameterised ReLU where instead of multiplying the input with only "a" for negative inputs we multiply it with a Log Curve of input and "a".

It can be shown as:

    F(n) = n,       if n>=0
            = a * ( (e^n) - 1 ) , otherwise


Softmax
This is the Activation function that we used in our previous post of this series. This is basically to overcome the limitation of the sigmoid function to only classify 2 types of classes.
Softmax is used for Multiclass Classification problems.

It can be written as:
    F(n) = (e^x) / ( SummationOfEach( e^x ) )



Swish
This is an activation function founded by Google Researchers and shows better performance than ReLU on Deeper Neural Networks. Its value ranges from negative infinity to positive infinity But it is a lesser known activation function.

It can be written as:

    F(n) = [ n / ( 1 + e^(-n) ) ] 

or in more simpler way we can write it as:

    F(n) = ( n * sigmoid(n) ) 

   
[NOTE: There are more types of activation functions but we have shown the more generally used Activation Functions]



Which Activation Function to choose?

The answer to this question is varying for different functions but still there are some common thumb rules to follow:

You can use ReLU with any ML/DL model as it a general activation function. But if you think that there is the dying ReLU problem then use Leaky-ReLU. But you should use ReLU with the models having hidden layers.
Use the Sigmoid Function for 2 class classification problems but try to avoid using sigmoid with the Tanh Function as it has the Vanishing Gradient Problem
Use Softmax Function for multi-class classification problem.

At last, in general you can start with ReLU and then switch to other functions if the accuracy of the model is not that good.




Coding our Activation Functions

So, coming to the most exciting part, the coding :)
Just fire up your favorite IDE or Google Colab and begin the coding.
You can code with the Python version 3.5 or more.

So first we import the necessary libraries and then define a class and the constructor for it and all the methods ( Activation Function definitions)

class-func

Now, in the classes just add more methods:

different-activation-method-implemetation

different-activation-method-implemetation

different-activation-method-implemetation

different-activation-method-implemetation


The below "plot_all" Method is for plotting all the graphs:

plot-all-functions

Now, we initialize an object for our "ActivationFunction" class and then we define a loop for routine calls to be made to the methods based on the user input.
As only integer values are supported so we wrap our input with try except block

main-method

main-method

main-method

main-method



Here we are just Plotting all the functions with the help of the above code, and you can play around with this code just by downloading it from our Github or copy pasting the code in your python enabled IDE.

Subplots-of-all-Activation-Functions



So, guys this was all about the activation functions.

Please let us know your doubts in the comments and we will get back to you soon.

So, in the next part of this series we will discuss about Gradient Descent and till then keep practicing more and more.

*************************
We will end today's post with a quote:

For the things we have to learn before we can do them, we learn by doing them.

-Aristotle

***************************************************