Optimizers racing to the Minima

Step by step implementation with animation for better understanding

neuralthreads

10 min readNov 30, 2021

Back to the previous post

Back to the first post

2.10 Optimizers racing to the Minima

This will be the final post in Chapter 2 — Optimizers.

You can download the Jupyter Notebook from here.

Note — It is recommended that you have a look at the previous posts in which we talked about SGD, SGD with Momentum, SGD with Nesterov acceleration, Adagrad, RMSprop, Adadelta, Adam, Amsgrad, and Adamax.

Here we will create an animation in which all the 9 Optimizers which we have seen in the previous posts will race each other to reach the Minima. But before going further, few things first.

First, this post is neither a true nor a pseudo comparison of Optimizers. I strongly recommend you to go through the literature before arriving at the conclusion ‘Which Optimizers is the best?’ There are many things that were skipped in previous posts, one of them is the mathematical motivation to develop the optimizers.

Second, epsilon in optimizers is used for stability and by stability, I mean this.

Suppose, we start from a point that is already the minima, in that case, using epsilon will prevent error division by zero.

But, for the case of Adadelta, epsilon was used to kick-start the Optimizer. So, we will use epsilon = 10**-6 for Adadelta and 10**-8 for all other optimizers.

And, we will be using a learning rate = 0.15 for Adagrad and 0.01 for all other optimizers except for Adadelta for which we will use learning rate = 1.

Third, this post gives you an idea of how to make such videos. You can also plot in 3-D and can see how we are sliding to reach the minima or you can put a color contour on the background which gives you an idea of function values.

All the steps are similar to what we did in the previous 9 videos. We will store starting point and updated points in a list, but this time we will use all values up to iᵗʰ index for iᵗʰ frame of our animation. We will start from points (3, 4). And the animation will be of 1500 frames at 50 fps.

Note — You will notice that for some optimizers minima is not exactly reached but that is okay for us. We will focus on their trajectory. You may animate yourself up to 2000 frames or even more.

And, now the animation for the race.

import numpy as np 
np.random.seed(42)import matplotlib.pyplot as plt 
import matplotlib.animation as animation 
from matplotlib.animation import PillowWriter

We will use the same 2 variable function which we used for the previous 9 posts.

We know that Minima for this function is at (2, -1)
and we will start from (3, 4)

The partial derivatives are

def f(x, y):                                    # function
    return 2*(x**2) + 2*x*y + 2*(y**2) - 6*x    # definitiondef fdash_x(x, y):                              # partial derivative
    return 4*x + 2*y - 6                        # w.r.t xdef fdash_y(x, y):                              # partial derivative
    return 2*x + 4*y                            # w.r.t ydef gradient(point):
    return np.array([[     fdash_x(point[0][0], point[1][0])     ],
                     [     fdash_y(point[0][0], point[1][0])     ]], dtype = np.float64)                             # gradients

Defining the function, its partial derivatives, and gradient array

Now we will store updated points in the lists which initially have the starting point in them.

SGD

point_sgd = [np.array([[   3   ],                         # list
                       [   4   ]], dtype = np.float64)]point = np.array([[   3   ],
                  [   4   ]], dtype = np.float64)learning_rate = 0.01for i in range(1500):
    update = - learning_rate * gradient(point)  
    point += update
    
    point_sgd.append(point.copy())                        # adding
                                        updated points to the list
    
point

SGD with Momentum

point_sgd_momentum = [np.array([[   3   ],                # list
                                [   4   ]], dtype = np.float64)]point = np.array([[   3   ],
                  [   4   ]], dtype = np.float64)learning_rate = 0.01momentum = 0.9
update = np.array([[   0   ],
                   [   0   ]], dtype = np.float64)for i in range(1500):
    update = - learning_rate * gradient(point) + momentum * update
    point += update
    
    point_sgd_momentum.append(point.copy())               # adding
                                        updated points to the list
    
point

SGD with Nesterov acceleration

point_sgd_nesterov = [np.array([[   3   ],                # list
                                [   4   ]], dtype = np.float64)]point = np.array([[   3   ],
                  [   4   ]], dtype = np.float64)learning_rate = 0.01momentum = 0.9
update = np.array([[   0   ],
                   [   0   ]], dtype = np.float64)for i in range(1500):
    update = - learning_rate * gradient(point) + momentum * update
    update_ = - learning_rate * gradient(point) + momentum * update
    point += update_
    
    point_sgd_nesterov.append(point.copy())               # adding
                                        updated points to the list
    
point

Adagrad

point_adagrad = [np.array([[   3   ],                    # list
                           [   4   ]], dtype = np.float64)]point = np.array([[   3   ],
                  [   4   ]], dtype = np.float64)learning_rate = 0.15accumulator = np.array([[   0.1   ],
                        [   0.1   ]], dtype = np.float64)
epsilon = 10**-8for i in range(1500):
    accumulator += gradient(point)**2
    update = - learning_rate * gradient(point) / (accumulator**0.5 +
                                                            epsilon)
    point += update
    
    point_adagrad.append(point.copy())                    # adding
                                        updated points to the list
    
point

RMSprop

point_rmsprop = [np.array([[   3   ],                     # list
                           [   4   ]], dtype = np.float64)]point = np.array([[   3   ],
                  [   4   ]], dtype = np.float64)learning_rate = 0.01rho = 0.9
accumulator = np.array([[   0   ],
                        [   0   ]], dtype = np.float32)
epsilon = 10**-8for i in range(1500):
    accumulator = rho * accumulator + (1 - rho) * gradient(point)**2
    update = - learning_rate * gradient(point) / (accumulator**0.5 +
                                                            epsilon)
    point += update
    
    point_rmsprop.append(point.copy())                    # adding
                                        updated points to the list
    
point

Adadelta

point_adadelta = [np.array([[   3   ],                    # list
                            [   4   ]], dtype = np.float64)]point = np.array([[   3   ],
                  [   4   ]], dtype = np.float64)learning_rate = 1rho = 0.95
epsilon = 10**-6
accumulator_gradient = np.array([[   0   ],
                                 [   0   ]], dtype = np.float64)
accumulator_update = np.array([[   0   ],
                               [   0   ]], dtype = np.float64)for i in range(1500):
    accumulator_gradient = rho * accumulator_gradient + (1 - rho) *
                                                  gradient(point)**2
    update = -gradient(point) * (accumulator_update + epsilon)**0.5
                            / (accumulator_gradient + epsilon)**0.5
    accumulator_update = rho * accumulator_update + (1 - rho) *
                                                       update**2
    point += learning_rate * update
    
    point_adadelta.append(point.copy())                   # adding
                                        updated points to the list
    
point

Adam

point_adam = [np.array([[   3   ],                        # list
                        [   4   ]], dtype = np.float64)]point = np.array([[   3   ],
                  [   4   ]], dtype = np.float64)learning_rate = 0.01beta1 = 0.9
beta2 = 0.999
epsilon = 10**-8
moment1 = np.array([[   0   ],
                    [   0   ]], dtype = np.float64)
moment2 = np.array([[   0   ],
                    [   0   ]], dtype = np.float64)for i in range(1500):
    learning_rate_hat = learning_rate * np.sqrt(1 - beta2**(i + 1))
                                             / (1 - beta1**(i + 1))
    moment1 = beta1 * moment1 + (1 - beta1) * gradient(point)
    moment2 = beta2 * moment2 + (1 - beta2) * gradient(point)**2
    update = - learning_rate_hat * moment1 / (moment2**0.5 +
                                                     epsilon)
    point += update
    
    point_adam.append(point.copy())                       # adding
                                        updated points to the list
    
point

Amsgrad

point_amsgrad = [np.array([[   3   ],                     # list
                           [   4   ]], dtype = np.float64)]point = np.array([[   3   ],
                  [   4   ]], dtype = np.float64)learning_rate = 0.01        beta1 = 0.9
beta2 = 0.999
epsilon = 10**-8
moment1 = np.array([[   0   ],
                    [   0   ]], dtype = np.float64)
moment2 = np.array([[   0   ],
                    [   0   ]], dtype = np.float64)
moment2_hat = np.array([[   0   ],
                        [   0   ]], dtype = np.float64)for i in range(1500):
    learning_rate_hat = learning_rate * np.sqrt(1 - beta2**(i + 1))
                                             / (1 - beta1**(i + 1))
    moment1 = beta1 * moment1 + (1 - beta1) * gradient(point)
    moment2 = beta2 * moment2 + (1 - beta2) * gradient(point)**2
    moment2_hat = np.maximum(moment2_hat, moment2)
    update = - learning_rate_hat * moment1 / (moment2_hat**0.5 +
                                                          epsilon)
    point += update
    
    point_amsgrad.append(point.copy())                    # adding
                                        updated points to the list
    
point

Adamax

point_adamax = [np.array([[   3   ],                      # list
                          [   4   ]], dtype = np.float64)]point = np.array([[   3   ],
                  [   4   ]], dtype = np.float64)learning_rate = 0.01beta1 = 0.9
beta2 = 0.999
epsilon = 10**-8
moment = np.array([[   0   ],
                   [   0   ]], dtype = np.float64)
weight = np.array([[   0   ],
                   [   0   ]], dtype = np.float64)for i in range(1500):
    learning_rate_hat = learning_rate / (1 - beta1**(i + 1))
    moment = beta1 * moment + (1 - beta1) * gradient(point)
    weight = np.maximum(beta2 * weight, abs(gradient(point)))
    update = - learning_rate_hat * moment / (weight + epsilon)
    point += update
    
    point_adamax.append(point.copy())                     # adding
                                        updated points to the list
    
point

Animation for the race

We will do some settings for our graph for the animation. You can change them if you want something different.

plt.rcParams.update({'font.size': 22})fig = plt.figure(dpi = 100)fig.set_figheight(10.80)
fig.set_figwidth(19.20)ax = plt.axes()
ax.grid(alpha = 0.5)
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.set_xlabel('x')
ax.set_ylabel('y', rotation = 0)ax.set_title('Optimizers racing to the Minima')ax.legend(['SGD, lr = 0.01', 'SGD_Momentum, lr = 0.01', 'SGD_Nesterov, lr = 0.01', 
           'Adagrad, lr = 0.15', 'RMSProp, lr = 0.01 &', 'Adadelta, lr = 1 & ep = 10**-6',
           'Adam, lr = 0.01', 'Amsgrad, lr = 0.01', 'Adamax, lr = 0.01'], loc = "upper left")

Few settings for our graph in the animation

ax.hlines(-1, -5, 5, linestyles = 'dashed', alpha = 0.5)
ax.vlines(3, -5, 5, linestyles = 'dashed', alpha = 0.5)lines = []
for index in range(9):
    lobj = ax.plot([], [], lw = 2)[0]
    lines.append(lobj)def init():
    for line in lines:
        line.set_data([], [])
    return lines

xdata_sgd, ydata_sgd = [], [] 
xdata_sgd_momentum, ydata_sgd_momentum = [], []
xdata_sgd_nesterov, ydata_sgd_nesterov = [], [] 
xdata_adagrad, ydata_adagrad = [], [] 
xdata_rmsprop, ydata_rmsprop = [], [] 
xdata_adadelta, ydata_adadelta = [], [] 
xdata_adam, ydata_adam = [], [] 
xdata_amsgrad, ydata_amsgrad = [], [] 
xdata_adamax, ydata_adamax = [], []

List to store values up to ith index which will be used in ith frame

Now we will animate the race.

def animate(i): 
    
    xdata_sgd.append(point_sgd[i][0][0])
    ydata_sgd.append(point_sgd[i][1][0])
    
    xdata_sgd_momentum.append(point_sgd_momentum[i][0][0])
    ydata_sgd_momentum.append(point_sgd_momentum[i][1][0])
    
    xdata_sgd_nesterov.append(point_sgd_nesterov[i][0][0])
    ydata_sgd_nesterov.append(point_sgd_nesterov[i][1][0])
    
    xdata_adagrad.append(point_adagrad[i][0][0])
    ydata_adagrad.append(point_adagrad[i][1][0])
    
    xdata_rmsprop.append(point_rmsprop[i][0][0])
    ydata_rmsprop.append(point_rmsprop[i][1][0])
    
    xdata_adadelta.append(point_adadelta[i][0][0])
    ydata_adadelta.append(point_adadelta[i][1][0])
    
    xdata_adam.append(point_adam[i][0][0])
    ydata_adam.append(point_adam[i][1][0])
    
    xdata_amsgrad.append(point_amsgrad[i][0][0])
    ydata_amsgrad.append(point_amsgrad[i][1][0])
    
    xdata_adamax.append(point_adamax[i][0][0])
    ydata_adamax.append(point_adamax[i][1][0])
    
    
    xlist = [xdata_sgd,
             xdata_sgd_momentum,
             xdata_sgd_nesterov,
             xdata_adagrad,
             xdata_rmsprop,
             xdata_adadelta,
             xdata_adam,
             xdata_amsgrad,
             xdata_adamax]
    
    ylist = [ydata_sgd,
             ydata_sgd_momentum,
             ydata_sgd_nesterov,
             ydata_adagrad,
             ydata_rmsprop,
             ydata_adadelta,
             ydata_adam,
             ydata_amsgrad,
             ydata_adamax]for lnum, line in enumerate(lines):
        if lnum in [0, 1, 2, 3, 4, 5, 6, 7, 8]:
            
                                                        #      0    for    SGD
                                                        #      1    for    SGD with Momentum
                                                        #      2    for    SGD with Nesterov acceleration
                                                        #      3    for    Adagrad
                                                        #      4    for    RMSProp
                                                        #      5    for    Adadelta
                                                        #      6    for    Adam
                                                        #      7    for    Amsgrad
                                                        #      8    for    Adamax
            
            line.set_data(xlist[lnum], ylist[lnum])
        
    ax.legend(['SGD, lr = 0.01', 'SGD_Momentum, lr = 0.01', 'SGD_Nesterov, lr = 0.01', 
               'Adagrad, lr = 0.15', 'RMSProp, lr = 0.01 &', 'Adadelta, lr = 1 & ep = 10**-6',
               'Adam, lr = 0.01', 'Amsgrad, lr = 0.01', 'Adamax, lr = 0.01'], loc = "upper left")
    
    return lines

using all values up to ith index for ith frame by storing them in another list

line number 64 using values up to ith index for ith frame

anim = animation.FuncAnimation(fig, animate, init_func = init, 
                               frames = 1500, interval = 20, blit = True)anim.save('2.10 Optimizers racing to the Minima.gif')

If you only want to compare 2 or 3 Optimizers or you want to see the trajectory of a single Optimizer, then you only have to change line number 52.

If you only want to see Adadelta, then line 52 is ‘if lnum in [5]:’
or if you want SGD with Nesterov acceleration, RMSprop and Adamax, then line 52 is ‘if lnum in [2, 4, 8]:’

Note — You should clear the previous plot to run the animation function again or you will plot the new animation on the previous plot.

I know, this animation program could have been smaller and optimized but we are here to learn and the code should be readable and simple.

With this post, Chapter 2 — Optimizers is over. With the next post, we will start Chapter 3 — Activation functions and their derivatives. The most important post will be the last one in which we will talk about Softmax activation functions and its Jacobian which is super easy. I don’t know why people don’t talk about that.

Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.

The video is basically everything in the post only in slides.