Gradient Descent — Trick developed in 1847 is now the first step in the field of Neural Networks
Step by step implementation with animation for better understanding.
2.1 What is SGD or Stochastic Gradient Descent?
When someone new enters the field of Deep Learning, Stochastic Gradient Descent is the first optimizer the person encounters. It is one of the oldest optimizing techniques to reduce the loss function. What we mean by loss function and how these optimizers are used in DL will be talked about later. Right now our focus is to simply understand the steps used in SGD Optimizer and we will also use the help of animation to understand what exactly is happening.
You can download the Jupyter Notebook from here.
This post is divided into 3 sections
- SGD in 1 Variable function
- SGD Animation for 1 variable function
- SGD in multi-variable function
SGD in 1 variable function
Suppose, we need to find the minima of this function.
We will use calculus for this. We will simply take the derivative of the functions and equate it to 0.
Which gives us
The points will either be minima, or maxima, or points of inflection.
For minima, the second derivative should be positive.
So, the minima is at
Here is the graph of the function.
But how will the computer find the local minima? They can’t do what we did.
This is where we use the SGD or Stochastic Gradient Descent or simply the Gradient Descent.
We will use the analogy of a ball sliding down the hill.
We will also slide down the function to reach the minima.
First, we will start at a random point on the function. We are starting from
x = -1
Then we use the fact that derivative or gradient gives us the steepest ascent.
So, the negative gradient gives us the steepest descent.
The magnitude of the gradient is not ideal to use. So, we will multiply it with a scalar called ‘Learning rate’. The learning rate is generally in the range of 0.01 to 0.001
The product of negative gradient and learning rate will be our update and we will add the update to the point.
So, SGD algorithm in very simple language is as follows:
Step 1 - Set starting point and learning rate
Step 2 - Initiate loop
Step 2.1 - Calculate update = - learning_rate * gradient
Step 2.2 - add the update to point
Let us see how to implement SGD in Python.
First, we will define the function definition and function derivative definition.
import numpy as np
np.random.seed(42)def f(x): # function definition
return x - x**3def fdash(x): # function derivative definition
return 1 - 3*(x**2)
And now it is time to write the steps of SGD
point = -1 # step 1
learning_rate = 0.01for i in range(1000): # step 2
update = - learning_rate * fdash(point) # step 2.1
point += update # step 2.2
print(point) # Minima
See, it is very easy to implement SGD in Python.
SGD Animation for better understanding
Using the animation gives us a pretty good feel to understand what is happening in SGD.
For this, we will create a list in Python that will store the starting point and all the update points in it. And, we will use the iᵗʰ index value from the list for iᵗʰ frame in the animation.
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.animation import PillowWriterpoint_sgd = [-1] # initiating list with
# starting point in itpoint = -1 # step 1
learning_rate = 0.01for i in range(1000): # step 2
update = - learning_rate * fdash(point) # step 2.1
point += update # step 2.2
point_sgd.append(point) # adding updated point
# to the list
print(point) # Minima
We will do some settings for our graph for the animation. You can change them if you want something different.
plt.rcParams.update({'font.size': 22})fig = plt.figure(dpi = 100)fig.set_figheight(10.80)
fig.set_figwidth(19.20)x_ = np.linspace(-5, 5, 10000)
y_ = f(x_)ax = plt.axes()
ax.plot(x_, y_)
ax.grid(alpha = 0.5)
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.set_xlabel('x')
ax.set_ylabel('y', rotation = 0)
ax.scatter(-1, f(-1), color = 'red')
ax.hlines(f(-0.5773502691896256), -5, 5, linestyles = 'dashed', alpha = 0.5)ax.set_title('SGD, learning_rate = 0.01')
Now we will animate the SGD optimizer.
def animate(i):
ax.clear()
ax.plot(x_, y_)
ax.grid(alpha = 0.5)
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.set_xlabel('x')
ax.set_ylabel('y', rotation = 0)
ax.hlines(f(-0.5773502691896256), -5, 5, linestyles = 'dashed', alpha = 0.5)
ax.set_title('SGD, learning_rate = 0.01')
ax.scatter(point_sgd[i], f(point_sgd[i]), color = 'red')
The last line in the code snippet above is using the iᵗʰ index value from the list for iᵗʰ frame in the animation.
anim = animation.FuncAnimation(fig, animate, frames = 200, interval = 20)anim.save('2.1 SGD.gif')
We are creating an animation that only has 200 frames and the gif is at 50 fps or frame interval is 20 ms.
It is to be noted that in less than 200 iterations we have reached the minima.
SGD in multi-variable function (2 variables right now)
The trick is the same, every step is the same but for every variable, we will use the partial derivative of the function w.r.t that variable to calculate the update.
But first, let us define the function and its partial derivatives.
We are using this function
We know that Minima for this function is at (2, -1)
and we will start from (1, 0)
The partial derivatives are
def f(x, y): # function
return 2*(x**2) + 2*x*y + 2*(y**2) - 6*x # definitiondef fdash_x(x, y): # partial derivative
return 4*x + 2*y - 6 # w.r.t xdef fdash_y(x, y): # partial derivative
return 2*x + 4*y # w.r.t y
Now we will create an array of gradients that will store partial derivatives in the shape (-1, 1) and an array for points to store ‘x’ and ‘y’ also in the shape (-1, 1). In this case, -1 is equivalent to 2.
def gradient(point):
return np.array([[ fdash_x(point[0][0], point[1][0]) ],
[ fdash_y(point[0][0], point[1][0]) ]], dtype = np.float64) #gradients
You can see that the shape of gradient and point is (2, 1)
Now, we can calculate the update in matrix form like this
And we can reduce it to a single line in Python like this
And we can add the update to point in matrix form like this
So, we will do exactly the same as what we did earlier in 1 variable. All the steps are the same, but we will initialize point (1, 0) with shape (2, 1) and will replace fdash(point) with gradient(point).
point = np.array([[ 1 ], # step 1
[ 0 ]], dtype = np.float64)learning_rate = 0.01for i in range(1000): # step 2
update = - learning_rate * gradient(point) # step 2.1
point += update # step 2.2
point # Minima
You can see that the only difference is that we initialize point (1, 0) with shape (2, 1) and because the gradient is of shape (2, 1), the update is also of shape (2, 1).
If you are looking for the animation for SGD in 2 variables then, you have to wait till the last post of this chapter, i.e., 2.10 Optimizers racing to the Minima in which we will animate a race where all the optimizers SGD, SGD with Momentum, SGD with Nesterov acceleration, Adagrad, RMSprop, Adadelta, Adam, Amsgrad and Adamax will race to reach the minima.
I hope now you understand SGD. If you are looking for mini-batch or batch SGD, then you have to wait till Chapter 5, i.e., 5.6 Batch Training.
Now as a bonus, let us talk about a few things which can go wrong with Gradient Descent.
- Overshooting
Suppose that the update size happens to be big due to any reason, in this case, due to the magnitude of the starting point. We can see that we have crossed the maxima and now we will go all the way to infinity resulting in overflow error.
This is called ‘Overshooting’ in which we overshoot. This can be avoided by proper selection of starting point and learning rate.
2. Oscillation
We can see here that due to symmetry we are in a loop where we jump from one point to another and back.
This is highly unlikely to occur. This can be avoided by proper selection of starting point and learning rate.
3. Divergence
Here instead of converging, we are actually diverging.
It happens a lot and in Deep Learning, where the loss gets big and big with training. This can be avoided by proper selection of starting point and learning rate.
4. Different initial point leads to different Minima
If we start from x = 0.60, then we will end at x = 0 for minima which is local
But, if we start from x = -0.61, then we will end at x = -1.604 for minima which is global
5. Gradients are very very small
Suppose our derivative is of very high power, in that case, even 100 million iterations will not lead us to the Minima, in this case, 0.
It is highly unlikely that we have to deal with such small gradients. Either start from a point close to minima or increase the learning rate.
Most of the time these things will not be a problem to us but the problem of loss divergence is very common in Deep Learning.
Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.
The video is basically everything in the post only in slides.
Many thanks for your support and feedback.
If you like this course, then you can support me at
It would mean a lot to me.