SGD with Nesterov acceleration — How it reduces the oscillation in SGD with Momentum

Step by step implementation with animation for better understanding.

6 min readNov 26, 2021

2.3 What is SGD with Nesterov acceleration?

We have seen in the previous post the working of SGD with Momentum. And it was obvious from the definition of SGD with Momentum and animation that the optimizer oscillates. Sometimes it is required to reduce the oscillation and get faster convergence. For that, we use a trick which we call Nesterov acceleration. The trick is to calculate the update with correction. we do that by calculating update 2 times in which the second update is correction derived from the first update.

You can download the Jupyter Notebook from here.

Note — It is recommended that you have a look at the previous post.

This post is divided into 3 sections.

SGD with Nesterov acceleration in 1 variable
SGD with Nesterov acceleration animation for 1 variable
SGD with Nesterov acceleration in multi-variable function

SGD with Nesterov acceleration in 1 variable

In this method, everything is the same as what we did in SGD with Momentum but we calculate the update 2 times before adding it to the point.

SGD with Nesterov acceleration algorithm in simple language is as follows:

Step 1 - Set staring point and leanring rate
Step 2 - Initialize update = 0 and momentum = 0.9
Step 3 - Initiate loop
         Step 3.1 - calculate update = -learning_rate * gradient +
                                                momentum * update
         Step 3.2 - calculate update_ = -learning_rate * gradient +
                                                momentum * update
         Step 3.3 - add update_ to point

Note — In this method the second update, i.e., ‘update_’ can be treated as update with correction.

First, let us define the function and its derivative and we start from x = -1

import numpy as np
np.random.seed(42)def f(x):                           # function definition     
    return x - x**3def fdash(x):                       # function derivative definition
    return 1 - 3*(x**2)

Defining the function and its derivative

And now SGD with Nesterov acceleration

point = -1                                      # step 1
learning_rate = 0.01momentum = 0.9                                  # step 2
update = 0for i in range(1000):                           # step 3
    update = - learning_rate * fdash(point) + momentum * update
                                                # step 3.1
    update_ = - learning_rate * fdash(point) + momentum * update
                                                # step 3.2
    point += update_                            # step 3.3
    
point                                           # Minima

Step involved in SGD with Nesterov acceleration

We have successfully implemented SGD with Nesterov acceleration in Python.

SGD with Nesterov acceleration animation for better understanding

Everything thing is the same as what we did earlier for the animation of SGD or SGD with Momentum. We will create a list to store starting point and updated points in it and will use the iᵗʰ index value for iᵗʰ frame of the animation.

import matplotlib.pyplot as plt 
import matplotlib.animation as animation 
from matplotlib.animation import PillowWriterpoint_sgd_nesterov = [-1]                   # initiating list with
                                            # starting point in itpoint = -1                                  # step 1
learning_rate = 0.01momentum = 0.9                              # step 2
update = 0for i in range(1000):                       # step 3
    update = - learning_rate * fdash(point) + momentum * update
                                            # step 3.1
    update_ = - learning_rate * fdash(point) + momentum * update
                                            # step 3.2
    point += update_                        # step 3.3
    
    point_sgd_nesterov.append(point)        # adding updated point
                                            # to the list
    
point                                       # Minima

Importing libraries and creating a list that has the starting point and updated points in the list

We will do some settings for our graph for the animation. You can change them if you want something different.

plt.rcParams.update({'font.size': 22})fig = plt.figure(dpi = 100)fig.set_figheight(10.80)
fig.set_figwidth(19.20)x_ = np.linspace(-5, 5, 10000)
y_ = f(x_)ax = plt.axes()
ax.plot(x_, y_)
ax.grid(alpha = 0.5)
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.set_xlabel('x')
ax.set_ylabel('y', rotation = 0)
ax.scatter(-1, f(-1), color = 'red')
ax.hlines(f(-0.5773502691896256), -5, 5, linestyles = 'dashed', alpha = 0.5)ax.set_title('SGD with Nesterov acceleration, learning_rate = 0.01')

Few settings for our graph in the animation

Now we will animate the SGD with Nesterov acceleration optimizer.

def animate(i):
    ax.clear()
    ax.plot(x_, y_)
    ax.grid(alpha = 0.5)
    ax.set_xlim(-5, 5)
    ax.set_ylim(-5, 5)
    ax.set_xlabel('x')
    ax.set_ylabel('y', rotation = 0)
    ax.hlines(f(-0.5773502691896256), -5, 5, linestyles = 'dashed', alpha = 0.5)
    ax.set_title('SGD with Nesterov acceleration, learning_rate = 0.01')
    
    ax.scatter(point_sgd_nesterov[i], f(point_sgd_nesterov[i]), color = 'red')

The last line in the code snippet above is using the iᵗʰ index value from the list for iᵗʰ frame in the animation.

anim = animation.FuncAnimation(fig, animate, frames = 200, interval = 20)anim.save('2.3 SGD with Nesterov acceleration.gif')

We are creating an animation that only has 200 frames and the gif is at 50 fps or frame interval is 20 ms.

It is to be noted that in less than 200 iterations we have reached the minima.

SGD with Nesterov acceleration Animation

SGD with Nesterov acceleration in multi-variable function (2 variables right now)

Everything is the same, we only have to initialize point (1, 0) and update = 0 but with shape (2, 1) and replace fdash(point) with gradient(point).

But first, let us define the function, its partial derivatives and, gradient array

We know that Minima for this function is at (2, -1)
and we will start from (1, 0)

The partial derivatives are

def f(x, y):                                    # function
    return 2*(x**2) + 2*x*y + 2*(y**2) - 6*x    # definitiondef fdash_x(x, y):                              # partial derivative
    return 4*x + 2*y - 6                        # w.r.t xdef fdash_y(x, y):                              # partial derivative
    return 2*x + 4*y                            # w.r.t ydef gradient(point):
    return np.array([[     fdash_x(point[0][0], point[1][0])     ],
                     [     fdash_y(point[0][0], point[1][0])     ]], dtype = np.float64)                             # gradients

Defining the function, its partial derivatives, and gradient array

Now the steps for SGD with Nesterov acceleration in 2 variables are

point = np.array([[   1   ],                       # step 1
                  [   0   ]], dtype = np.float64)learning_rate = 0.01momentum = 0.9                                     # step 2
update = np.array([[   0   ],
                   [   0   ]], dtype = np.float64)for i in range(1000):                              # step 3
    update = - learning_rate * gradient(point) + momentum * update
                                                   # step 3.1
    update_ = - learning_rate * gradient(point) + momentum * update
                                                   # step 3.2
    point += update_                               # step 3.3
    
point                                              # Minima

Steps involved in SGD with Nesterov acceleration for 2 variable function

I hope now you understand SGD with Nesterov acceleration.
If you want to study more about Nesterov acceleration, then you may look at the literature available on the internet.

Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.

The video is basically everything in the post only in slides.