Binary cross-entropy loss — Special case of Categorical cross-entropy loss

Step by step implementation and its gradients with an example

5 min readDec 2, 2021

In this post, we will talk about Binary cross-entropy loss and what it means. We will also see how to compute the gradients? And how it is a special case of Categorical cross-entropy loss?

You can download the Jupyter Notebook from here.

Back to the previous post

Back to the first post

4.4 What is Binary cross-entropy loss and how to compute the gradients?

Let me ask you which of the following sentences are true for this image?

He is RDJ.
He played the role of Iron-man in MCU.
He also played Jack Sparrow.
He is also Sherlock Holmes.

We know that the correct answers ‘y_true’ are,

But let us suppose you don’t know and you guess your answer in probabilities ‘y_pred’ as follow,

Two things,

First, Each element in y_true and y_pred is an independent answer unlike the Categorical cross-entropy example because we have 4 questions in this example.

Second, we can set a threshold value at 0.8, i.e., all values equal to or greater than 0.8 are 1 and others are 0. In that case, you correctly answered 3 out of 4 questions. But what if the threshold value is 0.74. In that case, you correctly answered all questions but there is still an error because the predicted value for the third question is not 0 or close to 0.

Here, we will use Binary cross-entropy loss.

Suppose we have true values,

and predicted values,

Then Binary cross-entropy liss is calculated as follow:

We can easily calculate Binary cross-entropy loss in Python like this.

import numpy as np                           # importing NumPy
np.random.seed(42)def B_cross_E(y_true, y_pred):               # BCE
    return -np.mean(y_true * np.log(y_pred + 10**-100) + (1 -
                      y_true) * np.log(1 - y_pred + 10**-100))

Now, we know that

So, like MSE, MAE, and CE, we have a Jacobian for BCE.

We can easily find each term in this Jacobian.

Note — Here, 3 represents ‘N’, i.e., the entries in y_true and y_pred

We can easily define the BCE Jacobian in Python like this.

def B_cross_E_grad(y_true, y_pred):              # BCE Jacobian
    
    N = y_true.shape[0]
    
    return -(y_true/(y_pred + 10**-100) - (1 - y_true)/(1 - y_pred +
                                                        10**-100))/N

Let us have a look at an example.

y_true = np.array([[1], [0], [1], [1]])
y_truey_pred = np.array([[0.4], [0.5], [0.8], [0.2]])
y_pred

B_cross_E(y_true, y_pred)B_cross_E_grad(y_true, y_pred)

I hope now you understand what is Binary cross-entropy loss.

Now, let us talk a bit about Categorical and Binary cross-entropy loss functions.

A simple ‘Yes/No’ question never has an answer either ‘Yes’ or ‘No’. They both always come in compliment to each other, i.e., ‘Yes’ + ‘No’ = 1

If we had only 1 question in this example of Binary cross-entropy loss, then it is a simple case of Categorical cross-entropy loss with two options, ‘Yes’ and ‘No’.

But we have ‘N’ questions, so we will take mean.

Note — In Chapter 5, we will talk more about the Sigmoid activation function and Binary cross-entropy loss function for Backpropagation. Because, in the output of the Sigmoid function, every element is independent and is between 0 and 1 and they can be interpreted as probabilities or answers to ‘N’ number of questions.

With this post, Chapter 4 — Losses and their derivatives is finished. We will be starting Chapter 5 — Diving deep in the Neural Networks with the next post in which we will talk about how Artificial Neural Networks are built which will be followed by Backpropagation, L1 and L2 penalties, Dropout, Layer Normalization, Batch training nad Validation sets and finally UCI white wine quality dataset example.

Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.

The video is basically everything in the post only in slides.