Optimizers racing to the Minima

Step by step implementation with animation for better understanding

neuralthreads
10 min readNov 30, 2021

Back to the previous post

Back to the first post

2.10 Optimizers racing to the Minima

This will be the final post in Chapter 2 — Optimizers.

You can download the Jupyter Notebook from here.

Note — It is recommended that you have a look at the previous posts in which we talked about SGD, SGD with Momentum, SGD with Nesterov acceleration, Adagrad, RMSprop, Adadelta, Adam, Amsgrad, and Adamax.

Here we will create an animation in which all the 9 Optimizers which we have seen in the previous posts will race each other to reach the Minima. But before going further, few things first.

First, this post is neither a true nor a pseudo comparison of Optimizers. I strongly recommend you to go through the literature before arriving at the conclusion ‘Which Optimizers is the best?’ There are many things that were skipped in previous posts, one of them is the mathematical motivation to develop the optimizers.

Second, epsilon in optimizers is used for stability and by stability, I mean this.

Suppose, we start from a point that is already the minima, in that case, using epsilon will prevent error division by zero.

But, for the case of Adadelta, epsilon was used to kick-start the Optimizer. So, we will use epsilon = 10**-6 for Adadelta and 10**-8 for all other optimizers.

And, we will be using a learning rate = 0.15 for Adagrad and 0.01 for all other optimizers except for Adadelta for which we will use learning rate = 1.

Third, this post gives you an idea of how to make such videos. You can also plot in 3-D and can see how we are sliding to reach the minima or you can put a color contour on the background which gives you an idea of function values.

All the steps are similar to what we did in the previous 9 videos. We will store starting point and updated points in a list, but this time we will use all values up to iᵗʰ index for iᵗʰ frame of our animation. We will start from points (3, 4). And the animation will be of 1500 frames at 50 fps.

Note — You will notice that for some optimizers minima is not exactly reached but that is okay for us. We will focus on their trajectory. You may animate yourself up to 2000 frames or even more.

And, now the animation for the race.

import numpy as np 
np.random.seed(42)
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.animation import PillowWriter
Importing libraries

We will use the same 2 variable function which we used for the previous 9 posts.

We know that Minima for this function is at (2, -1)
and we will start from (3, 4)

The partial derivatives are

def f(x, y):                                    # function
return 2*(x**2) + 2*x*y + 2*(y**2) - 6*x # definition
def fdash_x(x, y): # partial derivative
return 4*x + 2*y - 6 # w.r.t x
def fdash_y(x, y): # partial derivative
return 2*x + 4*y # w.r.t y
def gradient(point):
return np.array([[ fdash_x(point[0][0], point[1][0]) ],
[ fdash_y(point[0][0], point[1][0]) ]], dtype = np.float64) # gradients
Defining the function, its partial derivatives, and gradient array

Now we will store updated points in the lists which initially have the starting point in them.

SGD

point_sgd = [np.array([[   3   ],                         # list
[ 4 ]], dtype = np.float64)]
point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)
learning_rate = 0.01for i in range(1500):
update = - learning_rate * gradient(point)
point += update

point_sgd.append(point.copy()) # adding
updated points to the list


point
SGD

SGD with Momentum

point_sgd_momentum = [np.array([[   3   ],                # list
[ 4 ]], dtype = np.float64)]
point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)
learning_rate = 0.01momentum = 0.9
update = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
for i in range(1500):
update = - learning_rate * gradient(point) + momentum * update
point += update

point_sgd_momentum.append(point.copy()) # adding
updated points to the list


point
SGD with Momentum

SGD with Nesterov acceleration

point_sgd_nesterov = [np.array([[   3   ],                # list
[ 4 ]], dtype = np.float64)]
point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)
learning_rate = 0.01momentum = 0.9
update = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
for i in range(1500):
update = - learning_rate * gradient(point) + momentum * update
update_ = - learning_rate * gradient(point) + momentum * update
point += update_

point_sgd_nesterov.append(point.copy()) # adding
updated points to the list


point
SGD with Nesterov acceleration

Adagrad

point_adagrad = [np.array([[   3   ],                    # list
[ 4 ]], dtype = np.float64)]
point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)
learning_rate = 0.15accumulator = np.array([[ 0.1 ],
[ 0.1 ]], dtype = np.float64)
epsilon = 10**-8
for i in range(1500):
accumulator += gradient(point)**2
update = - learning_rate * gradient(point) / (accumulator**0.5 +
epsilon)
point += update

point_adagrad.append(point.copy()) # adding
updated points to the list


point
Adagrad

RMSprop

point_rmsprop = [np.array([[   3   ],                     # list
[ 4 ]], dtype = np.float64)]
point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)
learning_rate = 0.01rho = 0.9
accumulator = np.array([[ 0 ],
[ 0 ]], dtype = np.float32)
epsilon = 10**-8
for i in range(1500):
accumulator = rho * accumulator + (1 - rho) * gradient(point)**2
update = - learning_rate * gradient(point) / (accumulator**0.5 +
epsilon)
point += update

point_rmsprop.append(point.copy()) # adding
updated points to the list


point
RMSprop

Adadelta

point_adadelta = [np.array([[   3   ],                    # list
[ 4 ]], dtype = np.float64)]
point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)
learning_rate = 1rho = 0.95
epsilon = 10**-6
accumulator_gradient = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
accumulator_update = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
for i in range(1500):
accumulator_gradient = rho * accumulator_gradient + (1 - rho) *
gradient(point)**2
update = -gradient(point) * (accumulator_update + epsilon)**0.5
/ (accumulator_gradient + epsilon)**0.5
accumulator_update = rho * accumulator_update + (1 - rho) *
update**2
point += learning_rate * update

point_adadelta.append(point.copy()) # adding
updated points to the list


point
Adadelta

Adam

point_adam = [np.array([[   3   ],                        # list
[ 4 ]], dtype = np.float64)]
point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)
learning_rate = 0.01beta1 = 0.9
beta2 = 0.999
epsilon = 10**-8
moment1 = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
moment2 = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
for i in range(1500):
learning_rate_hat = learning_rate * np.sqrt(1 - beta2**(i + 1))
/ (1 - beta1**(i + 1))
moment1 = beta1 * moment1 + (1 - beta1) * gradient(point)
moment2 = beta2 * moment2 + (1 - beta2) * gradient(point)**2
update = - learning_rate_hat * moment1 / (moment2**0.5 +
epsilon)
point += update

point_adam.append(point.copy()) # adding
updated points to the list


point
Adam

Amsgrad

point_amsgrad = [np.array([[   3   ],                     # list
[ 4 ]], dtype = np.float64)]
point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)
learning_rate = 0.01 beta1 = 0.9
beta2 = 0.999
epsilon = 10**-8
moment1 = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
moment2 = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
moment2_hat = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
for i in range(1500):
learning_rate_hat = learning_rate * np.sqrt(1 - beta2**(i + 1))
/ (1 - beta1**(i + 1))
moment1 = beta1 * moment1 + (1 - beta1) * gradient(point)
moment2 = beta2 * moment2 + (1 - beta2) * gradient(point)**2
moment2_hat = np.maximum(moment2_hat, moment2)
update = - learning_rate_hat * moment1 / (moment2_hat**0.5 +
epsilon)
point += update

point_amsgrad.append(point.copy()) # adding
updated points to the list


point
Amsgrad

Adamax

point_adamax = [np.array([[   3   ],                      # list
[ 4 ]], dtype = np.float64)]
point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)
learning_rate = 0.01beta1 = 0.9
beta2 = 0.999
epsilon = 10**-8
moment = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
weight = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
for i in range(1500):
learning_rate_hat = learning_rate / (1 - beta1**(i + 1))
moment = beta1 * moment + (1 - beta1) * gradient(point)
weight = np.maximum(beta2 * weight, abs(gradient(point)))
update = - learning_rate_hat * moment / (weight + epsilon)
point += update

point_adamax.append(point.copy()) # adding
updated points to the list


point
Adamax

Animation for the race

We will do some settings for our graph for the animation. You can change them if you want something different.

plt.rcParams.update({'font.size': 22})fig = plt.figure(dpi = 100)fig.set_figheight(10.80)
fig.set_figwidth(19.20)
ax = plt.axes()
ax.grid(alpha = 0.5)
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.set_xlabel('x')
ax.set_ylabel('y', rotation = 0)
ax.set_title('Optimizers racing to the Minima')ax.legend(['SGD, lr = 0.01', 'SGD_Momentum, lr = 0.01', 'SGD_Nesterov, lr = 0.01',
'Adagrad, lr = 0.15', 'RMSProp, lr = 0.01 &', 'Adadelta, lr = 1 & ep = 10**-6',
'Adam, lr = 0.01', 'Amsgrad, lr = 0.01', 'Adamax, lr = 0.01'], loc = "upper left")
Few settings for our graph in the animation
The very first frame of the animation
ax.hlines(-1, -5, 5, linestyles = 'dashed', alpha = 0.5)
ax.vlines(3, -5, 5, linestyles = 'dashed', alpha = 0.5)
lines = []
for index in range(9):
lobj = ax.plot([], [], lw = 2)[0]
lines.append(lobj)
def init():
for line in lines:
line.set_data([], [])
return lines
Few lines of codes
xdata_sgd, ydata_sgd = [], [] 
xdata_sgd_momentum, ydata_sgd_momentum = [], []
xdata_sgd_nesterov, ydata_sgd_nesterov = [], []
xdata_adagrad, ydata_adagrad = [], []
xdata_rmsprop, ydata_rmsprop = [], []
xdata_adadelta, ydata_adadelta = [], []
xdata_adam, ydata_adam = [], []
xdata_amsgrad, ydata_amsgrad = [], []
xdata_adamax, ydata_adamax = [], []
List to store values up to ith index which will be used in ith frame

Now we will animate the race.

def animate(i): 

xdata_sgd.append(point_sgd[i][0][0])
ydata_sgd.append(point_sgd[i][1][0])

xdata_sgd_momentum.append(point_sgd_momentum[i][0][0])
ydata_sgd_momentum.append(point_sgd_momentum[i][1][0])

xdata_sgd_nesterov.append(point_sgd_nesterov[i][0][0])
ydata_sgd_nesterov.append(point_sgd_nesterov[i][1][0])

xdata_adagrad.append(point_adagrad[i][0][0])
ydata_adagrad.append(point_adagrad[i][1][0])

xdata_rmsprop.append(point_rmsprop[i][0][0])
ydata_rmsprop.append(point_rmsprop[i][1][0])

xdata_adadelta.append(point_adadelta[i][0][0])
ydata_adadelta.append(point_adadelta[i][1][0])

xdata_adam.append(point_adam[i][0][0])
ydata_adam.append(point_adam[i][1][0])

xdata_amsgrad.append(point_amsgrad[i][0][0])
ydata_amsgrad.append(point_amsgrad[i][1][0])

xdata_adamax.append(point_adamax[i][0][0])
ydata_adamax.append(point_adamax[i][1][0])


xlist = [xdata_sgd,
xdata_sgd_momentum,
xdata_sgd_nesterov,
xdata_adagrad,
xdata_rmsprop,
xdata_adadelta,
xdata_adam,
xdata_amsgrad,
xdata_adamax]

ylist = [ydata_sgd,
ydata_sgd_momentum,
ydata_sgd_nesterov,
ydata_adagrad,
ydata_rmsprop,
ydata_adadelta,
ydata_adam,
ydata_amsgrad,
ydata_adamax]
for lnum, line in enumerate(lines):
if lnum in [0, 1, 2, 3, 4, 5, 6, 7, 8]:

# 0 for SGD
# 1 for SGD with Momentum
# 2 for SGD with Nesterov acceleration
# 3 for Adagrad
# 4 for RMSProp
# 5 for Adadelta
# 6 for Adam
# 7 for Amsgrad
# 8 for Adamax

line.set_data(xlist[lnum], ylist[lnum])

ax.legend(['SGD, lr = 0.01', 'SGD_Momentum, lr = 0.01', 'SGD_Nesterov, lr = 0.01',
'Adagrad, lr = 0.15', 'RMSProp, lr = 0.01 &', 'Adadelta, lr = 1 & ep = 10**-6',
'Adam, lr = 0.01', 'Amsgrad, lr = 0.01', 'Adamax, lr = 0.01'], loc = "upper left")

return lines
using all values up to ith index for ith frame by storing them in another list
line number 64 using values up to ith index for ith frame
anim = animation.FuncAnimation(fig, animate, init_func = init, 
frames = 1500, interval = 20, blit = True)
anim.save('2.10 Optimizers racing to the Minima.gif')
Race Animation
Race Animation

If you only want to compare 2 or 3 Optimizers or you want to see the trajectory of a single Optimizer, then you only have to change line number 52.

Change line 52 to hide some Optimizers

If you only want to see Adadelta, then line 52 is ‘if lnum in [5]:’
or if you want SGD with Nesterov acceleration, RMSprop and Adamax, then line 52 is ‘if lnum in [2, 4, 8]:’

Note — You should clear the previous plot to run the animation function again or you will plot the new animation on the previous plot.

I know, this animation program could have been smaller and optimized but we are here to learn and the code should be readable and simple.

With this post, Chapter 2 — Optimizers is over. With the next post, we will start Chapter 3 — Activation functions and their derivatives. The most important post will be the last one in which we will talk about Softmax activation functions and its Jacobian which is super easy. I don’t know why people don’t talk about that.

Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.

The video is basically everything in the post only in slides.

Many thanks for your support and feedback.

If you like this course, then you can support me at

It would mean a lot to me.

Continue to the next post — 3.1 Sigmoid Activation function and its derivative.

--

--

No responses yet