Categorical cross-entropy loss — The most important loss function
Step by step implementation and its gradients with an example
This post is the most important post in the fourth chapter. Here we will talk about Categorical cross-entropy loss and what it means.
You can download the Jupyter Notebook from here.
4.3 What is Categorical cross-entropy loss and how to compute the gradients?
Suppose I ask you ‘Who is this actor?’
And I give you 3 options.
- Chris Evans
- RDJ
- Chris Hemsworth
We all know he is RDJ. So, the correct answer ‘y_true’ is
But, you don’t know who he is and you guess the answer with probabilities ‘y_pred’
Two things,
First, in both answers, i.e. correct and predicted, the sum of entries is equal to 1. Because they are probabilities or answers to the same question.
Second, we can be optimistic and say that we will take the entry with the highest magnitude. In that case, your predicted answer is correct because the highest probability of 0.45 is of the second entry, i.e., RDJ. But there is still an error because the predicted answer for RDJ is not 1 or close to 1.
Here, we will use Categorical cross-entropy loss.
Suppose we have true values,
and predicted values,
Then Categorical cross-entropy liss is calculated as follow:
We can easily calculate Categorical cross-entropy loss in Python like this.
import numpy as np # importing NumPy
np.random.seed(42)def cross_E(y_true, y_pred): # CE
return -np.sum(y_true * np.log(y_pred + 10**-100))
Now, we know that
So, like MSE and MAE, we have a Jacobian for CE.
We can easily find each term in this Jacobian.
We can easily define the CE Jacobian in Python like this.
def cross_E_grad(y_true, y_pred): # CE Jacobian
return -y_true/(y_pred + 10**-100)
Let us have a look at an example.
y_true = np.array([[0], [1], [0], [0]])
y_truey_pred = np.array([[0.05], [0.85], [0.10], [0]])
y_pred
cross_E(y_true, y_pred)cross_E_grad(y_true, y_pred)
I hope now you understand what is Categorical cross-entropy loss.
Note — In Chapter 5, we will talk more about the Softmax activation function and Categorical cross-entropy loss function for Backpropagation. Because, in the output of the Softmax function, the sum of elements is equal to 1 and they can be interpreted as probabilities or answers to a question.
Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.
The video is basically everything in the post only in slides.
Many thanks for your support and feedback.
If you like this course, then you can support me at
It would mean a lot to me.
Continue to the next post — 4.4 Binary cross-entropy loss and its derivative.