Categorical cross-entropy loss — The most important loss function

Step by step implementation and its gradients with an example

neuralthreads
4 min readDec 2, 2021

This post is the most important post in the fourth chapter. Here we will talk about Categorical cross-entropy loss and what it means.

You can download the Jupyter Notebook from here.

Back to the previous post

Back to the first post

4.3 What is Categorical cross-entropy loss and how to compute the gradients?

Suppose I ask you ‘Who is this actor?’

And I give you 3 options.

  1. Chris Evans
  2. RDJ
  3. Chris Hemsworth

We all know he is RDJ. So, the correct answer ‘y_true’ is

But, you don’t know who he is and you guess the answer with probabilities ‘y_pred’

Two things,

First, in both answers, i.e. correct and predicted, the sum of entries is equal to 1. Because they are probabilities or answers to the same question.

Second, we can be optimistic and say that we will take the entry with the highest magnitude. In that case, your predicted answer is correct because the highest probability of 0.45 is of the second entry, i.e., RDJ. But there is still an error because the predicted answer for RDJ is not 1 or close to 1.

Here, we will use Categorical cross-entropy loss.

Suppose we have true values,

and predicted values,

Then Categorical cross-entropy liss is calculated as follow:

We can easily calculate Categorical cross-entropy loss in Python like this.

import numpy as np                           # importing NumPy
np.random.seed(42)
def cross_E(y_true, y_pred): # CE
return -np.sum(y_true * np.log(y_pred + 10**-100))
Defining Categorical cross-entropy loss

Now, we know that

So, like MSE and MAE, we have a Jacobian for CE.

We can easily find each term in this Jacobian.

We can easily define the CE Jacobian in Python like this.

def cross_E_grad(y_true, y_pred):              # CE Jacobian
return -y_true/(y_pred + 10**-100)
Defining CE Jacobian

Let us have a look at an example.

y_true = np.array([[0], [1], [0], [0]])
y_true
y_pred = np.array([[0.05], [0.85], [0.10], [0]])
y_pred
y_true and y_pred
cross_E(y_true, y_pred)cross_E_grad(y_true, y_pred)
CE and the gradients

I hope now you understand what is Categorical cross-entropy loss.

Note — In Chapter 5, we will talk more about the Softmax activation function and Categorical cross-entropy loss function for Backpropagation. Because, in the output of the Softmax function, the sum of elements is equal to 1 and they can be interpreted as probabilities or answers to a question.

Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.

The video is basically everything in the post only in slides.

Many thanks for your support and feedback.

If you like this course, then you can support me at

It would mean a lot to me.

Continue to the next post — 4.4 Binary cross-entropy loss and its derivative.

--

--