When building your Deep Learning model, activation functions are an important choice to make. In this article, we’ll review the main activation functions, their implementations in Python, and advantages/disadvantages of each.

# Linear Activation

Linear activation is the simplest form of activation. In that case, $f(x)$ is just the identity. If you use a linear activation function the wrong way, your whole Neural Network ends up being a regression :

Linear activations are only needed when you’re considering a regression problem, as a last layer. The whole idea behind the other activation functions is to create non-linearity, to be able to model highly non-linear data that cannot be solved by a simple regression !

# ReLU

ReLU stands for Rectified Linear Unit. It is a widely used activation function. The formula is simply the maximum between $x$ and 0 :

To implement this in Python, you might simply use :

def relu(x) :
return max(x, 0)

The derivative of the ReLU is :

• $1$ if $x$ is greater than 0
• $0$ if $x$ is smaller or equal to 0
def der_relu(x):
if x <= 0 :
return 0
if x > 0 :
return 1

Let’s simulate some data and plot them to illustrate this activation function :

import numpy as np
import matplotlib.pyplot as plt

# Data which will go through activations
x = np.linspace(-10,10,100)

plt.figure(figsize=(12,8))
plt.plot(x, list(map(lambda x: relu(x),x)), label="relu")
plt.plot(x, list(map(lambda x: der_relu(x),x)), label="derivative")
plt.title("ReLU")
plt.legend()
plt.show()

 Advantage Easy to implement and quick to compute Disadvantage Problematic when we have lots of negative values, since the outcome is always 0 and leads to the death of the neuron

# Leaky-ReLU

Leaky-ReLU is an improvement of the main default of the ReLU, in the sense that it can handle the negative values pretty well, but still brings non-linearity.

The derivative is also simple to compute :

• $1$ if $x>0$
• $0.01$ else
def leaky_relu(x):
return max(0.01*x,x)

def der_leaky_relu(x):
if x < 0 :
return 0.01
if x >= 0 :
return 1

And we can plot the result of this activation function :

plt.figure(figsize=(12,8))
plt.plot(x, list(map(lambda x: leaky_relu(x),x)), label="leaky-relu")
plt.plot(x, list(map(lambda x: der_leaky_relu(x),x)), label="derivative")
plt.title("Leaky-ReLU")
plt.legend()
plt.show()

 Advantage Leaky-ReLU overcomes the problem of the death of the neuron linked to a zero-slope. Disadvantage The factor 0.01 is arbitraty, and can be tuned (PReLU for parametric ReLU)

# ELU

Exponential Linear Units (ELU) try to make the mean activations closer to zero, which speeds up learning.

$f(x) = x$ if $x>0$, and $a(e^x -1)$ otherwise, where $a$ is a positive constant.

The derivative is :

• $1$ if $x > 0$
• $a * e^x$ otherwise

If we set $a$ to 1 :

def elu(x):
if x > 0 :
return x
else :
return (np.exp(x)-1)

def der_elu(x):
if x > 0 :
return 1
else :
return np.exp(x)
plt.figure(figsize=(12,8))
plt.plot(x, list(map(lambda x: elu(x),x)), label="elu")
plt.plot(x, list(map(lambda x: der_elu(x),x)), label="derivative")
plt.title("ELU")
plt.legend()
plt.show()

 Advantage Can achieve higher accuracy than ReLU Disadvantage Same as ReLU, and a needs to be tuned

# Softplus

The Softplus function is a continuous approximation of ReLU. It is given by :

The derivative of the softplus function is :

You can implement them in Python :

def softplus(x):
return np.log(1+np.exp(x))

def der_softplus(x):
return 1/(1+np.exp(x))*np.exp(x)
plt.figure(figsize=(12,8))
plt.plot(x, list(map(lambda x: softplus(x),x)), label="softplus")
plt.plot(x, list(map(lambda x: der_softplus(x),x)), label="derivative")
plt.title("Softplus")
plt.legend()
plt.show()

 Advantage Softplus is continuous and might have good properties in terms of derivability. It is interesting to use it when the values are between 0 and 1. Disadvantage As ReLU, problematic when we have lots of negative values, since the outcome gets really close to 0 and might lead to the death of the neuron

# Sigmoid

Sigmoid is one of the most common activation functions in litterature these days. The sigmoid function has the following form :

The derivative of the sigmoid is :

In Python :

def softmax(x):
return 1/(1+np.exp(-x))

def der_softmax(x):
return 1/(1+np.exp(-x)) * (1-1/(1+np.exp(-x)))
plt.figure(figsize=(12,8))
plt.plot(x, list(map(lambda x: softmax(x),x)), label="softmax")
plt.plot(x, list(map(lambda x: der_softmax(x),x)), label="derivative")
plt.title("Softmax")
plt.legend()
plt.show()

It might not be obvious when considering data between $-10$ and $10$ only, but the sigmoid is subject to the vanishing gradient problem. This means that the gradient will tend to vanish as $x$ takes large values.

Since the gradient is the sigmoid times 1 minus the sigmoid, the gradient can be efficiently computed. If we keep track of the sigmoids, we can compute the gradient really quickly.

# Hyperbolic Tangent

Hyperbolic tangent is quite similar to the sigmoid function, expect is maps the input between -1 and 1.

The derivative is computed in the following way :

Implement this in Python :

def hyperb(x):
return (np.exp(x)-np.exp(-x)) / (np.exp(x) + np.exp(-x))

def der_hyperb(x):
return 1 - ((np.exp(x)-np.exp(-x)) / (np.exp(x) + np.exp(-x)))**2

And plot it :

plt.figure(figsize=(12,8))
plt.plot(x, list(map(lambda x: hyperb(x),x)), label="hyperbolic")
plt.plot(x, list(map(lambda x: der_hyperb(x),x)), label="derivative")
plt.title("Hyperbolic")
plt.legend()
plt.show()

 Advantage Efficient since it has mean 0 in the middle layers between -1 and 1 Disadvantage Vanishing gradient too

# Arctan

This activation function maps the input values in the range $(− \pi / 2, \pi / 2)$. It is the inverse of a hyperbolic tangent function.

And the derivative :

In Python :

def arctan(x):
return np.arctan(x)

def der_arctan(x):
return 1 / (1+x**2)

And plot it :

plt.figure(figsize=(12,8))
plt.plot(x, list(map(lambda x: arctan(x),x)), label="arctan")
plt.plot(x, list(map(lambda x: der_arctan(x),x)), label="derivative")
plt.title("Arctan")
plt.legend()
plt.show()