Adamax — Extension to Adam based on infinity norm

Step by step implementation with animation for better understanding

6 min readNov 29, 2021

2.9 What is Adamax?

In this method, we don’t directly have a second moment, but we have a weight that uses the maximum of previous weights and the current absolute value of gradients.

Like Adam, we will include correction bias in the learning rate calling it ‘learning rate hat’.

where beta1 is the decay rate for the first moment and ‘i’ is the iteration loop index starting from 0.

And, we calculate update as follow:

You can download the Jupyter Notebook from here.

Note — It is recommended that you have a look at the first post in this chapter.

This post is divided into 3 sections.

Adamax in 1 variable
Adamax animation for 1 variable
Adamax in multi-variable function

Adamax in 1 variable

In this method, we calculate weight as the maximum of beta2 times past weight and the gradient.

Adamax algorithm in simple language is as follows:

Step 1 - Set starting point and learning rate
Step 2 - Initiate beta1 = 0.9
                  beta2 = 0.999
                  moment = 0
                  weight = 0
                  epsilon = 10**-8
Step 3 - Initiate loop
         Step 3.1 - calculate learning rate hat as stated above
         Step 3.2 - calculate moment = beta1 * moment + 
                                        (1 - beta1) * gradient
         Step 3.3 - calculate weight = maximum(beta2 * weight,
                                               absolute(gradient))
         Step 3.4 - calculate update as stated above
         Step 3.5 - add update to point

First, let us define the function and its derivative and we start from x = -1

import numpy as np
np.random.seed(42)def f(x):                           # function definition                        
    return x - x**3def fdash(x):                       # function derivative definition
    return 1 - 3*(x**2)

Defining the function and its derivative

And now Adamax

point = -1                                        # step 1
learning_rate = 0.01beta1 = 0.9                                       # step 2
beta2 = 0.999
epsilon = 10**-8
moment = 0
weight = 0for i in range(1000):                             # step 3
    learning_rate_hat = learning_rate / (1 - beta1**(i + 1))
                                                  # step 3.1
    moment = beta1 * moment + (1 - beta1) * fdash(point)
                                                  # step 3.2
    weight = np.maximum(beta2 * weight, abs(fdash(point)))
                                                  # step 3.3
    update = - learning_rate_hat * moment / (weight + epsilon)
                                                  # step 3.4
    point += update                               # step 3.5
    
point                                             # Minima

And, we have successfully implemented Adamax in Python.

Adamax animation for better understanding

Everything thing is the same as what we did earlier for the animation of the previous 8optimizers. We will create a list to store starting point and updated points in it and will use the iᵗʰ index value for iᵗʰ frame of the animation.

import matplotlib.pyplot as plt 
import matplotlib.animation as animation 
from matplotlib.animation import PillowWriterpoint_adamax = [-1]                      # initiating list with
                                         # starting point in itpoint = -1                               # step 1
learning_rate = 0.01beta1 = 0.9                              # step 2
beta2 = 0.999
epsilon = 10**-8
moment = 0
weight = 0for i in range(1000):                    # step 3
    learning_rate_hat = learning_rate / (1 - beta1**(i + 1))
                                         # step 3.1
    moment = beta1 * moment + (1 - beta1) * fdash(point)
                                         # step 3.2
    weight = np.maximum(beta2 * weight, abs(fdash(point)))
                                         # step 3.3
    update = - learning_rate_hat * moment / (weight + epsilon)
                                         # step 3.4
    point += update                      # step 3.5
    
    point_adamax.append(point)           # adding updated point to
                                         # the list
    
point                                    # Minima

Importing libraries and creating a list that has the starting point and updated points in the list

We will do some settings for our graph for the animation. You can change them if you want something different.

plt.rcParams.update({'font.size': 22})fig = plt.figure(dpi = 100)fig.set_figheight(10.80)
fig.set_figwidth(19.20)x_ = np.linspace(-5, 5, 10000)
y_ = f(x_)ax = plt.axes()
ax.plot(x_, y_)
ax.grid(alpha = 0.5)
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.set_xlabel('x')
ax.set_ylabel('y', rotation = 0)
ax.scatter(-1, f(-1), color = 'red')
ax.hlines(f(-0.5773502691896256), -5, 5, linestyles = 'dashed', alpha = 0.5)ax.set_title('Adamax, learning_rate = 0.01')

Few settings for our graph in the animation

Now we will animate the Adamax optimizer.

def animate(i):
    ax.clear()
    ax.plot(x_, y_)
    ax.grid(alpha = 0.5)
    ax.set_xlim(-5, 5)
    ax.set_ylim(-5, 5)
    ax.set_xlabel('x')
    ax.set_ylabel('y', rotation = 0)
    ax.hlines(f(-0.5773502691896256), -5, 5, linestyles = 'dashed', alpha = 0.5)
    ax.set_title('Adamax, learning_rate = 0.01')
    
    ax.scatter(point_adamax[i], f(point_adamax[i]), color = 'red')

The last line in the code snippet above is using the iᵗʰ index value from the list for iᵗʰ frame in the animation.

anim = animation.FuncAnimation(fig, animate, frames = 200, interval = 20)anim.save('2.9 Adamax.gif')

We are creating an animation that only has 200 frames and the gif is at 50 fps or frame interval is 20 ms.

It is to be noted that in less than 200 iterations we have reached the minima.

Adamax in multi-variable function (2 variables right now)

Everything is the same, we only have to initialize point (1, 0) and moment = 0 and weight= 0 but with shape (2, 1) and replace fdash(point) with gradient(point).

But first, let us define the function, its partial derivatives and, gradient array

We know that Minima for this function is at (2, -1)
and we will start from (1, 0)

The partial derivatives are

def f(x, y):                                    # function
    return 2*(x**2) + 2*x*y + 2*(y**2) - 6*x    # definitiondef fdash_x(x, y):                              # partial derivative
    return 4*x + 2*y - 6                        # w.r.t xdef fdash_y(x, y):                              # partial derivative
    return 2*x + 4*y                            # w.r.t ydef gradient(point):
    return np.array([[     fdash_x(point[0][0], point[1][0])     ],
                     [     fdash_y(point[0][0], point[1][0])     ]], dtype = np.float64)                             # gradients

Defining the function, its partial derivatives, and gradient array

Now the steps for Adamax in 2 variables are

point = np.array([[   1   ],                      # step 1
                  [   0   ]], dtype = np.float64)learning_rate = 0.01beta1 = 0.9                                       # step 2
beta2 = 0.999
epsilon = 10**-8
moment = np.array([[   0   ],
                   [   0   ]], dtype = np.float64)
weight = np.array([[   0   ],
                   [   0   ]], dtype = np.float64)for i in range(1000):                             # step 3
    learning_rate_hat = learning_rate / (1 - beta1**(i + 1))
                                                  # step 3.1
    moment = beta1 * moment + (1 - beta1) * gradient(point)
                                                  # step 3.2
    weight = np.maximum(beta2 * weight, abs(gradient(point)))
                                                  # step 3.3
    update = - learning_rate_hat * moment / (weight + epsilon)
                                                  # step 3.4
    point += update                               # step 3.5
    
point                                             # Minima

Steps involved in Adamax for 2 variable function

I hope now you understand how Adamax works.

With this post, all the optimizers which are in the course are covered. In the next post, we will create an animation in which all the 9 optimizers will race each other to reach the Minima.

If you are interested in studying more optimizers, then you may refer to the literature available on the internet.

Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.

The video is basically everything in the post only in slides.