Optimizers racing to the Minima
Step by step implementation with animation for better understanding
2.10 Optimizers racing to the Minima
This will be the final post in Chapter 2 — Optimizers.
You can download the Jupyter Notebook from here.
Note — It is recommended that you have a look at the previous posts in which we talked about SGD, SGD with Momentum, SGD with Nesterov acceleration, Adagrad, RMSprop, Adadelta, Adam, Amsgrad, and Adamax.
Here we will create an animation in which all the 9 Optimizers which we have seen in the previous posts will race each other to reach the Minima. But before going further, few things first.
First, this post is neither a true nor a pseudo comparison of Optimizers. I strongly recommend you to go through the literature before arriving at the conclusion ‘Which Optimizers is the best?’ There are many things that were skipped in previous posts, one of them is the mathematical motivation to develop the optimizers.
Second, epsilon in optimizers is used for stability and by stability, I mean this.
Suppose, we start from a point that is already the minima, in that case, using epsilon will prevent error division by zero.
But, for the case of Adadelta, epsilon was used to kick-start the Optimizer. So, we will use epsilon = 10**-6 for Adadelta and 10**-8 for all other optimizers.
And, we will be using a learning rate = 0.15 for Adagrad and 0.01 for all other optimizers except for Adadelta for which we will use learning rate = 1.
Third, this post gives you an idea of how to make such videos. You can also plot in 3-D and can see how we are sliding to reach the minima or you can put a color contour on the background which gives you an idea of function values.
All the steps are similar to what we did in the previous 9 videos. We will store starting point and updated points in a list, but this time we will use all values up to iᵗʰ index for iᵗʰ frame of our animation. We will start from points (3, 4). And the animation will be of 1500 frames at 50 fps.
Note — You will notice that for some optimizers minima is not exactly reached but that is okay for us. We will focus on their trajectory. You may animate yourself up to 2000 frames or even more.
And, now the animation for the race.
import numpy as np
np.random.seed(42)import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.animation import PillowWriter
We will use the same 2 variable function which we used for the previous 9 posts.
We know that Minima for this function is at (2, -1)
and we will start from (3, 4)
The partial derivatives are
def f(x, y): # function
return 2*(x**2) + 2*x*y + 2*(y**2) - 6*x # definitiondef fdash_x(x, y): # partial derivative
return 4*x + 2*y - 6 # w.r.t xdef fdash_y(x, y): # partial derivative
return 2*x + 4*y # w.r.t ydef gradient(point):
return np.array([[ fdash_x(point[0][0], point[1][0]) ],
[ fdash_y(point[0][0], point[1][0]) ]], dtype = np.float64) # gradients
Now we will store updated points in the lists which initially have the starting point in them.
SGD
point_sgd = [np.array([[ 3 ], # list
[ 4 ]], dtype = np.float64)]point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)learning_rate = 0.01for i in range(1500):
update = - learning_rate * gradient(point)
point += update
point_sgd.append(point.copy()) # adding
updated points to the list
point
SGD with Momentum
point_sgd_momentum = [np.array([[ 3 ], # list
[ 4 ]], dtype = np.float64)]point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)learning_rate = 0.01momentum = 0.9
update = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)for i in range(1500):
update = - learning_rate * gradient(point) + momentum * update
point += update
point_sgd_momentum.append(point.copy()) # adding
updated points to the list
point
SGD with Nesterov acceleration
point_sgd_nesterov = [np.array([[ 3 ], # list
[ 4 ]], dtype = np.float64)]point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)learning_rate = 0.01momentum = 0.9
update = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)for i in range(1500):
update = - learning_rate * gradient(point) + momentum * update
update_ = - learning_rate * gradient(point) + momentum * update
point += update_
point_sgd_nesterov.append(point.copy()) # adding
updated points to the list
point
Adagrad
point_adagrad = [np.array([[ 3 ], # list
[ 4 ]], dtype = np.float64)]point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)learning_rate = 0.15accumulator = np.array([[ 0.1 ],
[ 0.1 ]], dtype = np.float64)
epsilon = 10**-8for i in range(1500):
accumulator += gradient(point)**2
update = - learning_rate * gradient(point) / (accumulator**0.5 +
epsilon)
point += update
point_adagrad.append(point.copy()) # adding
updated points to the list
point
RMSprop
point_rmsprop = [np.array([[ 3 ], # list
[ 4 ]], dtype = np.float64)]point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)learning_rate = 0.01rho = 0.9
accumulator = np.array([[ 0 ],
[ 0 ]], dtype = np.float32)
epsilon = 10**-8for i in range(1500):
accumulator = rho * accumulator + (1 - rho) * gradient(point)**2
update = - learning_rate * gradient(point) / (accumulator**0.5 +
epsilon)
point += update
point_rmsprop.append(point.copy()) # adding
updated points to the list
point
Adadelta
point_adadelta = [np.array([[ 3 ], # list
[ 4 ]], dtype = np.float64)]point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)learning_rate = 1rho = 0.95
epsilon = 10**-6
accumulator_gradient = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
accumulator_update = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)for i in range(1500):
accumulator_gradient = rho * accumulator_gradient + (1 - rho) *
gradient(point)**2
update = -gradient(point) * (accumulator_update + epsilon)**0.5
/ (accumulator_gradient + epsilon)**0.5
accumulator_update = rho * accumulator_update + (1 - rho) *
update**2
point += learning_rate * update
point_adadelta.append(point.copy()) # adding
updated points to the list
point
Adam
point_adam = [np.array([[ 3 ], # list
[ 4 ]], dtype = np.float64)]point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)learning_rate = 0.01beta1 = 0.9
beta2 = 0.999
epsilon = 10**-8
moment1 = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
moment2 = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)for i in range(1500):
learning_rate_hat = learning_rate * np.sqrt(1 - beta2**(i + 1))
/ (1 - beta1**(i + 1))
moment1 = beta1 * moment1 + (1 - beta1) * gradient(point)
moment2 = beta2 * moment2 + (1 - beta2) * gradient(point)**2
update = - learning_rate_hat * moment1 / (moment2**0.5 +
epsilon)
point += update
point_adam.append(point.copy()) # adding
updated points to the list
point
Amsgrad
point_amsgrad = [np.array([[ 3 ], # list
[ 4 ]], dtype = np.float64)]point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)learning_rate = 0.01 beta1 = 0.9
beta2 = 0.999
epsilon = 10**-8
moment1 = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
moment2 = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
moment2_hat = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)for i in range(1500):
learning_rate_hat = learning_rate * np.sqrt(1 - beta2**(i + 1))
/ (1 - beta1**(i + 1))
moment1 = beta1 * moment1 + (1 - beta1) * gradient(point)
moment2 = beta2 * moment2 + (1 - beta2) * gradient(point)**2
moment2_hat = np.maximum(moment2_hat, moment2)
update = - learning_rate_hat * moment1 / (moment2_hat**0.5 +
epsilon)
point += update
point_amsgrad.append(point.copy()) # adding
updated points to the list
point
Adamax
point_adamax = [np.array([[ 3 ], # list
[ 4 ]], dtype = np.float64)]point = np.array([[ 3 ],
[ 4 ]], dtype = np.float64)learning_rate = 0.01beta1 = 0.9
beta2 = 0.999
epsilon = 10**-8
moment = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
weight = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)for i in range(1500):
learning_rate_hat = learning_rate / (1 - beta1**(i + 1))
moment = beta1 * moment + (1 - beta1) * gradient(point)
weight = np.maximum(beta2 * weight, abs(gradient(point)))
update = - learning_rate_hat * moment / (weight + epsilon)
point += update
point_adamax.append(point.copy()) # adding
updated points to the list
point
Animation for the race
We will do some settings for our graph for the animation. You can change them if you want something different.
plt.rcParams.update({'font.size': 22})fig = plt.figure(dpi = 100)fig.set_figheight(10.80)
fig.set_figwidth(19.20)ax = plt.axes()
ax.grid(alpha = 0.5)
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.set_xlabel('x')
ax.set_ylabel('y', rotation = 0)ax.set_title('Optimizers racing to the Minima')ax.legend(['SGD, lr = 0.01', 'SGD_Momentum, lr = 0.01', 'SGD_Nesterov, lr = 0.01',
'Adagrad, lr = 0.15', 'RMSProp, lr = 0.01 &', 'Adadelta, lr = 1 & ep = 10**-6',
'Adam, lr = 0.01', 'Amsgrad, lr = 0.01', 'Adamax, lr = 0.01'], loc = "upper left")
ax.hlines(-1, -5, 5, linestyles = 'dashed', alpha = 0.5)
ax.vlines(3, -5, 5, linestyles = 'dashed', alpha = 0.5)lines = []
for index in range(9):
lobj = ax.plot([], [], lw = 2)[0]
lines.append(lobj)def init():
for line in lines:
line.set_data([], [])
return lines
xdata_sgd, ydata_sgd = [], []
xdata_sgd_momentum, ydata_sgd_momentum = [], []
xdata_sgd_nesterov, ydata_sgd_nesterov = [], []
xdata_adagrad, ydata_adagrad = [], []
xdata_rmsprop, ydata_rmsprop = [], []
xdata_adadelta, ydata_adadelta = [], []
xdata_adam, ydata_adam = [], []
xdata_amsgrad, ydata_amsgrad = [], []
xdata_adamax, ydata_adamax = [], []
Now we will animate the race.
def animate(i):
xdata_sgd.append(point_sgd[i][0][0])
ydata_sgd.append(point_sgd[i][1][0])
xdata_sgd_momentum.append(point_sgd_momentum[i][0][0])
ydata_sgd_momentum.append(point_sgd_momentum[i][1][0])
xdata_sgd_nesterov.append(point_sgd_nesterov[i][0][0])
ydata_sgd_nesterov.append(point_sgd_nesterov[i][1][0])
xdata_adagrad.append(point_adagrad[i][0][0])
ydata_adagrad.append(point_adagrad[i][1][0])
xdata_rmsprop.append(point_rmsprop[i][0][0])
ydata_rmsprop.append(point_rmsprop[i][1][0])
xdata_adadelta.append(point_adadelta[i][0][0])
ydata_adadelta.append(point_adadelta[i][1][0])
xdata_adam.append(point_adam[i][0][0])
ydata_adam.append(point_adam[i][1][0])
xdata_amsgrad.append(point_amsgrad[i][0][0])
ydata_amsgrad.append(point_amsgrad[i][1][0])
xdata_adamax.append(point_adamax[i][0][0])
ydata_adamax.append(point_adamax[i][1][0])
xlist = [xdata_sgd,
xdata_sgd_momentum,
xdata_sgd_nesterov,
xdata_adagrad,
xdata_rmsprop,
xdata_adadelta,
xdata_adam,
xdata_amsgrad,
xdata_adamax]
ylist = [ydata_sgd,
ydata_sgd_momentum,
ydata_sgd_nesterov,
ydata_adagrad,
ydata_rmsprop,
ydata_adadelta,
ydata_adam,
ydata_amsgrad,
ydata_adamax]for lnum, line in enumerate(lines):
if lnum in [0, 1, 2, 3, 4, 5, 6, 7, 8]:
# 0 for SGD
# 1 for SGD with Momentum
# 2 for SGD with Nesterov acceleration
# 3 for Adagrad
# 4 for RMSProp
# 5 for Adadelta
# 6 for Adam
# 7 for Amsgrad
# 8 for Adamax
line.set_data(xlist[lnum], ylist[lnum])
ax.legend(['SGD, lr = 0.01', 'SGD_Momentum, lr = 0.01', 'SGD_Nesterov, lr = 0.01',
'Adagrad, lr = 0.15', 'RMSProp, lr = 0.01 &', 'Adadelta, lr = 1 & ep = 10**-6',
'Adam, lr = 0.01', 'Amsgrad, lr = 0.01', 'Adamax, lr = 0.01'], loc = "upper left")
return lines
anim = animation.FuncAnimation(fig, animate, init_func = init,
frames = 1500, interval = 20, blit = True)anim.save('2.10 Optimizers racing to the Minima.gif')
If you only want to compare 2 or 3 Optimizers or you want to see the trajectory of a single Optimizer, then you only have to change line number 52.
If you only want to see Adadelta, then line 52 is ‘if lnum in [5]:’
or if you want SGD with Nesterov acceleration, RMSprop and Adamax, then line 52 is ‘if lnum in [2, 4, 8]:’
Note — You should clear the previous plot to run the animation function again or you will plot the new animation on the previous plot.
I know, this animation program could have been smaller and optimized but we are here to learn and the code should be readable and simple.
With this post, Chapter 2 — Optimizers is over. With the next post, we will start Chapter 3 — Activation functions and their derivatives. The most important post will be the last one in which we will talk about Softmax activation functions and its Jacobian which is super easy. I don’t know why people don’t talk about that.
Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.
The video is basically everything in the post only in slides.
Many thanks for your support and feedback.
If you like this course, then you can support me at
It would mean a lot to me.
Continue to the next post — 3.1 Sigmoid Activation function and its derivative.