# Amsgrad — A variant of Adam using the maximum of past square gradients

## Step by step implementation with animation for better understanding

# 2.8 What is Amsgrad?

Amsgrad is an extension of Adam in which we have one extra step. Amsgrad uses the maximum of past and current square of gradients which we will call second-moment hat. All other steps are the same.

We calculate ‘learning rate hat’ as follow:

Where beta 1 and beta 2 are decay rates for the first and second moment respectively, and ‘i’ is the iteration loop index starting from 0.

And, we calculate update as follow:

You can download the Jupyter Notebook from here.

Note — It isrecommended that you have a look at the first post in this chapter.

This post is divided into 3 sections.

- Amsgrad in 1 variable
- Amsgrad animation for 1 variable
- Amsgrad in multi-variable function

# Amsgrad in 1 variable

Everything is the same as what we did in Adam. We will add one more step for calculating second-moment hat and initialize moment2_hat = 0.

Amsgrad algorithm in simple language is as follows:

Step 1 - Set starting point and learning rate

Step 2 - initiate beta1 = 0.9

beta2 = 0.999

moment1 = 0

moment2 = 0

moment2_hat = 0

epsilon = 10**-8

Step 3 - Initiate loop

Step 3.1 - calculate learning rate hat as stated above

Step 3.2 - calculate moment1 = beta1 * moment1 +

(1 - beta1) * gradient

Step 3.3 - calculate moment2 = beta2 * moment2 +

(1 - beta2) * gradient**2

Step 3.4 - calculate moment2_hat = maximum(moment2_hat,

moment2)Step 3.5 - calculate update as stated above

Step 3.6 - add update to point

First, let us define the function and its derivative and we start from x = -1

import numpy as np

np.random.seed(42)def f(x):# function definition

return x - x**3def fdash(x):# function derivative definition

return 1 - 3*(x**2)

And now Amsgrad

point = -1# step 1

learning_rate = 0.01beta1 = 0.9# step 2

beta2 = 0.999

epsilon = 10**-8

moment1 = 0

moment2 = 0

moment2_hat = 0for i in range(1000):# step 3

learning_rate_hat = learning_rate * np.sqrt(1 - beta2**(i + 1))

/ (1 - beta1**(i + 1))

# step 3.1moment1 = beta1 * moment1 + (1 - beta1) * fdash(point)

# step 3.2moment2 = beta2 * moment2 + (1 - beta2) * fdash(point)**2

# step 3.3moment2_hat = np.maximum(moment2_hat, moment2)

# step 3.4update = - learning_rate_hat * moment1 / (moment2_hat**0.5 +

epsilon)

# step 3.5point += update# step 3.6

point# Minima

Note — np.maximum is different from np.max

It returns an element-wise maximum of array elements.

And, we have successfully implemented Amsgrad in Python.

# Amsgrad animation for better understanding

Everything thing is the same as what we did earlier for the animation of the previous 7 optimizers. We will create a list to store starting point and updated points in it and will use the iᵗʰ index value for iᵗʰ frame of the animation.

import matplotlib.pyplot as plt

import matplotlib.animation as animation

from matplotlib.animation import PillowWriterpoint_amsgrad = [-1]# initiating list with

# starting point in itpoint = -1# step 1

learning_rate = 0.01beta1 = 0.9# step 2

beta2 = 0.999

epsilon = 10**-8

moment1 = 0

moment2 = 0

moment2_hat = 0for i in range(1000):# step 3

learning_rate_hat = learning_rate * np.sqrt(1 - beta2**(i + 1))

/ (1 - beta1**(i + 1))

# step 3.1moment1 = beta1 * moment1 + (1 - beta1) * fdash(point)

# step 3.2moment2 = beta2 * moment2 + (1 - beta2) * fdash(point)**2

# step 3.3moment2_hat = np.maximum(moment2_hat, moment2)

# step 3.4update = - learning_rate_hat * moment1 / (moment2_hat**0.5 +

epsilon)

# step 3.5point += update# step 3.6

point_amsgrad.append(point)# adding updated point to

# the list

point# Minima

We will do some settings for our graph for the animation. You can change them if you want something different.

plt.rcParams.update({'font.size': 22})fig = plt.figure(dpi = 100)fig.set_figheight(10.80)

fig.set_figwidth(19.20)x_ = np.linspace(-5, 5, 10000)

y_ = f(x_)ax = plt.axes()

ax.plot(x_, y_)

ax.grid(alpha = 0.5)

ax.set_xlim(-5, 5)

ax.set_ylim(-5, 5)

ax.set_xlabel('x')

ax.set_ylabel('y', rotation = 0)

ax.scatter(-1, f(-1), color = 'red')

ax.hlines(f(-0.5773502691896256), -5, 5, linestyles = 'dashed', alpha = 0.5)ax.set_title('Amsgrad, learning_rate = 0.01')

Now we will animate the Amsgrad optimizer.

`def animate(i):`

ax.clear()

ax.plot(x_, y_)

ax.grid(alpha = 0.5)

ax.set_xlim(-5, 5)

ax.set_ylim(-5, 5)

ax.set_xlabel('x')

ax.set_ylabel('y', rotation = 0)

ax.hlines(f(-0.5773502691896256), -5, 5, linestyles = 'dashed', alpha = 0.5)

ax.set_title('Amsgrad, learning_rate = 0.01')

ax.scatter(point_amsgrad[i], f(point_amsgrad[i]), color = 'red')

The last line in the code snippet above is using the iᵗʰ index value from the list for iᵗʰ frame in the animation.

anim = animation.FuncAnimation(fig, animate, frames = 200, interval = 20)anim.save('2.8 Amsgrad.gif')

We are creating an animation that only has 200 frames and the gif is at 50 fps or frame interval is 20 ms.

It is to be noted that in less than 200 iterations we have reached the minima.

# Amsgrad in multi-variable function (2 variables right now)

Everything is the same, we only have to initialize point (1, 0) and moment1 = 0, moment2 = 0 and moment2_hat = 0 but with shape (2, 1) and replace fdash(point) with gradient(point).

But first, let us define the function, its partial derivatives and, gradient array

We know that Minima for this function is at (2, -1)

and we will start from (1, 0)

The partial derivatives are

def f(x, y):# function

return 2*(x**2) + 2*x*y + 2*(y**2) - 6*x# definitiondef fdash_x(x, y):# partial derivative

return 4*x + 2*y - 6# w.r.t xdef fdash_y(x, y):# partial derivative

return 2*x + 4*y# w.r.t ydef gradient(point):

return np.array([[ fdash_x(point[0][0], point[1][0]) ],

[ fdash_y(point[0][0], point[1][0]) ]], dtype = np.float64)# gradients

Now the steps for Amsgrad in 2 variables are

point = np.array([[ 1 ],# step 1

[ 0 ]], dtype = np.float64)learning_rate = 0.01beta1 = 0.9# step 2

beta2 = 0.999

epsilon = 10**-8

moment1 = np.array([[ 0 ],

[ 0 ]], dtype = np.float64)

moment2 = np.array([[ 0 ],

[ 0 ]], dtype = np.float64)

moment2_hat = np.array([[ 0 ],

[ 0 ]], dtype = np.float64)for i in range(1000):# step 3

learning_rate_hat = learning_rate * np.sqrt(1 - beta2**(i + 1))

/ (1 - beta1**(i + 1))

# step 3.1moment1 = beta1 * moment1 + (1 - beta1) * gradient(point)

# step 3.2moment2 = beta2 * moment2 + (1 - beta2) * gradient(point)**2

# step 3.3moment2_hat = np.maximum(moment2_hat, moment2)# step 3.4

update = - learning_rate_hat * moment1 / (moment2_hat**0.5 +

epsilon)

# step 3.5point += update# step 3.6

point# Minima

I hope now you understand how Amsgrad works.

Watch the video on youtube and subscribe to the channel for videos and posts like this.

Every slide is 3 seconds long and without sound. You may pause the video whenever you like.

You may put on some music too if you like.

The video is basically everything in the post only in slides.

# Many thanks for your support and feedback.

If you like this course, then you can support me at

It would mean a lot to me.