Mastering Activation Functions in Neural Networks: A Comprehensive Guide
Updated: Apr 6
Introduction:
Neural networks are a type of machine learning algorithm that are inspired by the structure and function of the human brain. They are a series of interconnected nodes, or neurons, that process and transmit information. Neural networks are used for a variety of tasks such as image and speech recognition, natural language processing, and predictive modelling.
The basic structure of a neural network consists of an :-
Input layer: Raw data is sent into the first layer as input. It provides the network with information coming from the outside world. The nodes only transmit the data to the hidden layer in this layer; no calculation is done here.
One or more hidden layers: The nodes, which make up the abstraction component in any neural network, are concealed below the input layer in this layer. The hidden layer(s) perform all computations on the features entered through the input layer before sending the results to the output layer.
Output layer: This layer presents to the outside world the outcomes of the calculations made by the network.



This is what scientists thought about while developing neural networks
In order to reduce the error between the expected output and the actual output, the weights and biases of the neurons are changed during the training phase using an optimization method.
Y= Σ(weights*input + bias)
This Y is the output and can range from -infinity to +infinity. So, it is necessary to bound the output to get the desired prediction or generalized results. To control it within a specific range we use Activation functions.
Y=Activation function(Σ (weights*input + bias))
This Y gives a bounded value which produces better results in Real Time.
Define Activation function:
An activation function is a mathematical function used in artificial neural networks that determines the output of a neuron based on the input it receives.
Explain Activation function in Layman terms:
Think of it like this: When you see something, your brain processes the information and decides what action to take. An activation function is like the decision-making process in your brain. It takes the input (the things you see) and produces an output (the action you take).
Why do we need this?
The purpose of an activation function is to introduce non-linearity into the output of a neuron, which allows the neural network to learn complex patterns and make better predictions. Without an activation function, the output of the neuron would be a simple linear function of the input, which would severely limit the learning capabilities of the neural network.
Properties of Activation function:
Non-Linearity
Continuity
Differentiability
Range
Monotonic
Identity estimates close to the origin
Computational efficiency
Types of Activation function:
The activation function can be broadly classified into 2 categories:
Binary Step Function: Mathematical function that takes a scalar input and returns a binary output of either 0 or 1.
Linear Activation Function: The linear activation function is used in regression problems, where the output is a continuous value, rather than a binary value.
Some commonly used Nonlinear Activation functions:
There are several types of nonlinear activation functions commonly used in neural networks, including the sigmoid function, the rectified linear unit (ReLU) function, the hyperbolic tangent function, and others. Each type of activation function has its own advantages and disadvantages, and the choice of activation function depends on the specific problem being solved and the characteristics of the data being used.
Sigmoid function:
The sigmoid function is a type of activation function commonly used in artificial neural networks. It is a mathematical function that takes any input value and produces an output value between 0 and 1, which can be interpreted as a probability. The sigmoid function has an "S"-shaped curve and is defined by the formula:
If plotted on a graph sigmoid function looks like:

Code snippet:
import tensorflow as tf
from tensorflow.keras.activations import sigmoid
a = tf.constant([-2, -1, 0, 1, 2], dtype=tf.float32)
print (sigmoid(a))
Limitations:
The gradients of the sigmoid function can become very small for large values of x, which can slow down the learning process in neural networks. In other words, it gives rise to a problem of “vanishing gradients”. This means the network may refuse to learn further or is drastically slow.
Tanh function:
The hyperbolic tangent (tanh) function is another type of activation function commonly used in artificial neural networks. It is a mathematical function that takes any input value and produces an output value between -1 and 1, which can be interpreted as a rescaled and shifted version of the sigmoid function.
The tanh function has a "tanh"-shaped curve and is defined by the formula:
The tanh function is similar to the sigmoid function, but it is zero-centered, which can be useful in some neural network architectures. The tanh function also has steeper gradients than the sigmoid function, which can make it more effective at learning from the data.
If plotted on a graph tanh function looks like:

Code snippet:
import tensorflow as tf
from tensorflow.keras.activations import tanh
a = tf.constant([-2, -1, 0, 1, 2], dtype=tf.float32)
print (tanh(a))
Limitations:
It also has the problem of vanishing gradient but the derivatives are steeper than that of the sigmoid. Hence making the gradients stronger for tanh than sigmoid.
ReLU:
The ReLU function has become popular in recent years because it has several advantages over other activation functions. One advantage is that it is computationally efficient, making it well-suited for large neural networks. Another advantage is that it has been shown to improve the performance of deep neural networks, which are neural networks with many layers.
It introduces non-linearity into the network, which allows it to learn more complex features and patterns in the data. In addition, the ReLU function has a derivative that is either 0 or 1, which makes it easy to calculate gradients during backpropagation, a technique used to update the weights of the neural network during training.
The ReLU function is defined by the formula:
If plotted on a graph ReLU function looks like:

Code snippet:
import tensorflow as tf
from tensorflow.keras.activations import relu
a = tf.constant([-2, -1, 0, 1, 2], dtype=tf.float32)
print (relu(a))
Limitations:
Can lead to the "dying ReLU" problem, where some neurons effectively "die" during training and stop learning because their inputs are always negative.
Leaky ReLU:
It is a modified version of the standard ReLU activation function that addresses the problem of "dying" ReLUs, where neurons can become inactive and stop learning.
In the standard ReLU function, if the input is less than zero, the output is zero. However, in the Leaky ReLU function, if the input is less than zero, the output is a small non-zero value, usually a fraction of the input value.
The advantage of using the Leaky ReLU function is that it prevents neurons from becoming completely inactive, even if they have a negative input. This can help to improve the performance of neural networks, especially in deep learning models where there are many layers and a large number of neurons.
The Leaky ReLU function is defined by the formula:
f(x) = x for x ≥ 0
eg 0.01.x for x < 0
If plotted on a graph Leaky ReLU function looks like:

Code snippet:
from tensorflow.keras.layers import LeakyReLU, Dense
l = LeakyReLU(alpha=0.01)
Dense(5, activation=l)
Limitations:
Not well-suited for all types of data
Since Leaky ReLU introduces a small non-zero value for negative inputs, it can introduce some noise into the output. This noise can be problematic in some applications.
Choosing the right value for the leak rate can be tricky, and may require some trial and error.
PReLU:
PReLU stands for Parametric Rectified Linear Unit, and is an activation function commonly used in deep learning neural networks. It is a variation of the standard ReLU (Rectified Linear Unit) activation function, which addresses some of the limitations of ReLU and Leaky ReLU.
In the standard ReLU function, if the input is less than zero, the output is zero. In the Leaky ReLU function, if the input is less than zero, the output is a small non-zero value. In the PReLU function, if the input is less than zero, the output is a linear function of the input multiplied by a learnable parameter.
The PReLU function is defined by the formula:
f(x)= x for x ≥ 0
a⋅x for x<0
a is learnable parameter
If plotted on a graph PReLU function looks like:

Code snippet:
from tensorflow.keras.layers import PReLU, Dense
p = PReLU()
Dense(5, activation=p)
Limitations:
However, PReLU can also introduce additional parameters that need to be learned during training, which can increase the complexity of the neural network and the time required for training.
Conclusion:
In this blog, we tried explaining the basics of neural networks and also got to know the importance of various non linear activation functions with the mathematical expressions, graph and code snippets. We may conclude by saying that Activation functions play a crucial role in neural networks as they determine the output of a neuron and ultimately the output of the entire network.