Adadelta — Optimizer which was developed to eliminate the need for the learning rate
Step by step implementation with animation for better understanding.
2.6 How does Adadelta works?
Adadelta was developed to eliminate the need for a learning rate. In this method, we store the square of gradients and updates but in a restricted manner in accumulators.
And, we will calculate the update as follow:
You can download the Jupyter Notebook from here.
Note — It is recommended that you have a look at the first post in this chapter.
This post is divided into 3 sections.
- Adadelta in 1 variable
- Adadelta animation for 1 variable
- Adadelta in multi-variable function
Adadelta in 1 variable
In this method, we store the square of gradients and update in a restricted manner in accumulators.
Adadelta algorithm in simple language is as follows:
Step 1 - Set starting point
Step 2 - Initiate accumulator_gradient = 0
accumulator_update = 0
and set rho = 0.95 and epsilon = 10**-5
Step 3 - Initiate loop
Step 3.1 - calculate accumulator_gradient =
rho * accumulator_gradient + (1 - rho) * gradient**2
Step 3.2 - calculate update as stated above
Step 3.3 - calculate accumulator_update =
rho * accumulator_update + (1 - rho) * update**2
Step 3.4 - add update to point
Now, before moving forward let us talk about two things:
First, what about the learning rate?
Adadelta was developed to eliminate the need for a learning rate. But, for the sake of formality, because all the Optimizers in Chapter 2 have a learning rate, we will use learning rate = 1 and use it in step 3.4
So, Step 3.4 is add update * (learning rate = 1) to point
Second, You must have noticed that when we enter the loop the first time, the accumulator_update is 0. So, to make sure of progress, we have added epsilon to accumulator_update.
We can say that epsilon is used here to kick start the Adadelta Optimizer.
Note — You can use different values of epsilon if you want and the convergence rate will depend on the magnitude of epsilon.
So, the Adadelta algorithm in simple language is as follows:
Step 1 - Set starting point and set learning_rate = 1
Step 2 - Initiate accumulator_gradient = 0
accumulator_update = 0
and set rho = 0.95 and epsilon = 10**-5
Step 3 - Initiate loop
Step 3.1 - calculate accumulator_gradient =
rho * accumulator_gradient + (1 - rho) * gradient**2
Step 3.2 - calculate update as stated above
Step 3.3 - calculate accumulator_update =
rho * accumulator_update + (1 - rho) * update**2
Step 3.4 - add (update * learning_rate) to point
First, let us define the function and its derivative and we start from x = -1
import numpy as np
np.random.seed(42)def f(x): # function definition
return x - x**3def fdash(x): # function derivative definition
return 1 - 3*(x**2)
And now Adadelta
point = -1 # step 1
learning_rate = 1rho = 0.95 # step 2
epsilon = 10**-5
accumulator_gradient = 0
accumulator_update = 0for i in range(1000): # step 3
accumulator_gradient = rho * accumulator_gradient + (1 - rho) *
fdash(point)**2
# step 3.1
update = -fdash(point) * (accumulator_update + epsilon)**0.5 /
(accumulator_gradient + epsilon)**0.5
# step 3.2
accumulator_update = rho * accumulator_update + (1 - rho) *
update**2
# step 3.3
point += learning_rate * update # step 3.4
point # Minima
And, we have successfully implemented Adadelta in Python.
Adadelta animation for better understanding
Everything thing is the same as what we did earlier for the animation of the previous 5 optimizers. We will create a list to store starting point and updated points in it and will use the iᵗʰ index value for iᵗʰ frame of the animation.
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.animation import PillowWriterpoint_adadelta = [-1] # initiating list with
# starting point in itpoint = -1 # step 1
learning_rate = 1rho = 0.95 # step 2
epsilon = 10**-5
accumulator_gradient = 0
accumulator_update = 0for i in range(1000): # step 3
accumulator_gradient = rho * accumulator_gradient + (1 - rho) *
fdash(point)**2
# step 3.1
update = -fdash(point) * (accumulator_update + epsilon)**0.5 /
(accumulator_gradient + epsilon)**0.5
# step 3.2
accumulator_update = rho * accumulator_update + (1 - rho) *
update**2
# step 3.3
point += learning_rate * update # step 3.4
point_adadelta.append(point) # adding updated point to
# the list
point # Minima
We will do some settings for our graph for the animation. You can change them if you want something different.
plt.rcParams.update({'font.size': 22})fig = plt.figure(dpi = 100)fig.set_figheight(10.80)
fig.set_figwidth(19.20)x_ = np.linspace(-5, 5, 10000)
y_ = f(x_)ax = plt.axes()
ax.plot(x_, y_)
ax.grid(alpha = 0.5)
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.set_xlabel('x')
ax.set_ylabel('y', rotation = 0)
ax.scatter(-1, f(-1), color = 'red')
ax.hlines(f(-0.5773502691896256), -5, 5, linestyles = 'dashed', alpha = 0.5)ax.set_title('Adadelta, learning_rate = 1')
Now we will animate the Adadelta optimizer.
def animate(i):
ax.clear()
ax.plot(x_, y_)
ax.grid(alpha = 0.5)
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.set_xlabel('x')
ax.set_ylabel('y', rotation = 0)
ax.hlines(f(-0.5773502691896256), -5, 5, linestyles = 'dashed', alpha = 0.5)
ax.set_title('Adadelta, learning_rate = 1')
ax.scatter(point_adadelta[i], f(point_adadelta[i]), color = 'red')
The last line in the code snippet above is using the iᵗʰ index value from the list for iᵗʰ frame in the animation.
anim = animation.FuncAnimation(fig, animate, frames = 200, interval = 20)anim.save('2.6 Adadelta.gif')
We are creating an animation that only has 200 frames and the gif is at 50 fps or frame interval is 20 ms.
It is to be noted that in less than 200 iterations we have reached the minima.
Adadelta in multi-variable function (2 variables right now)
Everything is the same, we only have to initialize point (1, 0) and accumulator_gradient = 0 and accumulator_update = 0 but with shape (2, 1) and replace fdash(point) with gradient(point).
But first, let us define the function, its partial derivatives and, gradient array
We know that Minima for this function is at (2, -1)
and we will start from (1, 0)
The partial derivatives are
def f(x, y): # function
return 2*(x**2) + 2*x*y + 2*(y**2) - 6*x # definitiondef fdash_x(x, y): # partial derivative
return 4*x + 2*y - 6 # w.r.t xdef fdash_y(x, y): # partial derivative
return 2*x + 4*y # w.r.t ydef gradient(point):
return np.array([[ fdash_x(point[0][0], point[1][0]) ],
[ fdash_y(point[0][0], point[1][0]) ]], dtype = np.float64) # gradients
Now the steps for Adadelta in 2 variables are
point = np.array([[ 1 ], # step 1
[ 0 ]], dtype = np.float64)learning_rate = 1rho = 0.95 # step 2
epsilon = 10**-5
accumulator_gradient = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)
accumulator_update = np.array([[ 0 ],
[ 0 ]], dtype = np.float64)for i in range(1000): # step 3
accumulator_gradient = rho * accumulator_gradient + (1 - rho) *
gradient(point)**2
# step 3.1
update = -gradient(point) * (accumulator_update + epsilon)**0.5 /
(accumulator_gradient + epsilon)**0.5
# step 3.2
accumulator_update = rho * accumulator_update + (1 - rho) *
update**2
# step 3.3
point += learning_rate * update # step 3.4
point # Minima
I hope now you understand how Adadelta works.
Watch the video on youtube and subscribe to the channel for videos and posts like this.
Every slide is 3 seconds long and without sound. You may pause the video whenever you like.
You may put on some music too if you like.
The video is basically everything in the post only in slides.