Softmax function — It is frustrating that everyone talks about it but very few talk about its Jacobian

And it is so similar to the Sigmoid derivative but not exactly the same

neuralthreads
4 min readDec 1, 2021

This is the most important post in the third chapter. In this post, we will talk about the Softmax activation function and how to compute its Jacobian.

You can download the Jupyter Notebook from here.

Back to the previous post

Back to the first post

3.7 What is the Softmax function and how to compute its Jacobian?

This is the definition of the Softmax function.

Suppose we have ‘x’

Then ‘y’ is

We can easily define the Softmax function in Python by doing this reduction

import numpy as np                             # importing NumPy
np.random.seed(42)
def softmax(x): # Softmax
return np.exp(x) / np.sum(np.exp(x))
Defining Softmax function

Now, the most important question is how to compute its Derivative/Jacobian?

Unlike the Sigmoid activation function or any other previous activation function, we don’t have a situation like this.

Instead, we have a situation like this.

In such a case, we use something called Jacobians.

Jacobian in a very simple language is a collection of partial derivatives.

So, the Jacobian ‘J’ for the Softmax function is:

Let us start finding each term in this Jacobian.

Starting with the first term, i.e.,

Similarly, we can calculate

After finding every term in the Jacobian, we have a symmetric matrix

We can reduce it to define the Softmax Jacobian in Python like this.

You must have noticed that it is very similar to the derivative of the Sigmoid function but not exactly the same.

def softmax_dash(x):                           # Softmax Jacobian

I = np.eye(x.shape[0])

return softmax(x) * (I - softmax(x).T)
Defining Softmax Jacobian

Let us have a look at an example

x = np.array([[0.25], [-1], [2.3], [-0.2], [1]])
x
softmax(x)np.sum(softmax(x))softmax_dash(x)softmax_dash(x) == softmax_dash(x).T
Softmax example

You must have noticed that the sum of scalars in softmax ‘y’ is equal to 1.
Why? It is obvious from the definition of the Softmax function.

y1, y2, y3, y4… can be treated as probabilities or answers to a question because their sum is equal to 1. We will see more on this later when we will talk about the Categorical Cross-entropy loss function.

Softmax Jacobian

We can see that the Jacobian of the Softmax function is a symmetric matrix.

I hope that now you understand the Softmax function and its Jacobian.

With this, the third chapter is over. In the next post, we will start the Fourth Chapter — Losses and their derivatives with Mean Square Error.

Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.

The video is basically everything in the post only in slides.

Many thanks for your support and feedback.

If you like this course, then you can support me at

It would mean a lot to me.

Continue to the next post — 4.1 Mean Square Error and its derivative.

--

--