Lesson 5

Advanced Optimization: Understanding and Implementing ADAM

Introduction to ADAM

Hello! Today, we will explore the ADAM (Adaptive Moment Estimation) algorithm. This advanced optimization algorithm is a favorite among machine learning practitioners as it combines the advantages of two other extensions of Stochastic Gradient Descent (SGD): Root Mean Square Propagation (RMSprop) and Adaptive Gradient Algorithm (AdaGrad). Our primary focus today is understanding ADAM, and we will also build it from scratch in Python to optimize multivariable functions.

Understanding ADAM

Before we dive into ADAM, let us recall that classic gradient descent methods like SGD and even sophisticated versions like Momentum and RMSProp have some limitations. These limitations relate to sensitivity to learning rates, the issue of vanishing gradients, and the absence of individual adaptive learning rates for different parameters.

ADAM, a promising choice for an optimization algorithm, combines the merits of RMSProp and AdaGrad. It maintains a per-parameter learning rate adapted based on the average of recent magnitudes of the gradients for the weights (similar to RMSProp) and the average of recent gradients (like Momentum). This mechanism enables the algorithm to traverse quickly over the low gradient regions and slow down near the optimal points.

ADAM Mathematically

For ADAM, we modify the update rule of SGD, introducing two additional hyperparameters, beta1 and beta2. The hyperparameter beta1 controls the exponential decay rate for the first-moment estimates (similar to Momentum), while beta2 controls the exponential decay rate for the second-moment estimates (similar to RMSProp). The mathematical expression can be formulated as follows:

mt=β1mt1+(1β1)gradm_t = \beta_1 * m_{t-1} + (1 - \beta_1) * grad vt=β2vt1+(1β2)grad2v_t = \beta_2 * v_{t-1} + (1 - \beta_2) * grad^2 w=wαmtvt+ϵw = w - \alpha * \frac{m_t}{\sqrt{v_t} + \epsilon}

Here, m_t and v_t are estimates of the gradients' first moment (the mean) and the second moment (the uncentered variance), respectively, while grad represents the gradient. We also use an epsilon constant to maintain numerical stability and prevent division by zero, as in RMSProp.

ADAM in Python Code

Let's now consolidate the ADAM concept into Python code. We will define an ADAM function, which takes the gradients, the decay rates beta1 and beta2, a numerical constant epsilon, the learning rate, and previous estimates of m and v (initialized to 0) as input and returns the updated parameters, along with the updated m and v.

1def ADAM(beta1, beta2, epsilon, grad, m_prev, v_prev, learning_rate): 2 # Update biased first-moment estimate 3 m = beta1 * m_prev + (1 - beta1) * grad 4 5 # Update biased second raw moment estimate 6 v = beta2 * v_prev + (1 - beta2) * np.power(grad, 2) 7 8 # Calculate updates 9 updates = learning_rate * m / (np.sqrt(v) + epsilon) 10 return updates, m, v

v and m are initialized with zeros and therefore they are biased towards zero at the start of the optimization, especially when the decay rates are small (beta1 and beta2 close to 1).

To counteract these biases, Adam also usually includes the correction terms m_hat and v_hat. These terms adjust m and v by an amount that lessens as the number of time steps increases:

1m_hat = m / (1 - np.power(beta1, epoch+1)) # Correcting the bias for the first moment 2v_hat = v / (1 - np.power(beta2, epoch+1)) # Correcting the bias for the second moment 3 4updates = learning_rate * m_hat / (np.sqrt(v_hat) + epsilon) 5return updates, m, v

Note that we still return plain m and v.

Application of ADAM on Multivariable Function Optimization

Now, let's test ADAM slightly by finding the minimum of a multivariable function f(x, y) = x^2 + y^2. The corresponding gradients are df/dx = 2*x and df/dy = 2*y. With an initial starting point at (x, y) = (3, 4), selected reasonable values for beta1=0.9, beta2=0.9999, epsilon=1e-8, learning_rate=0.02 and an epoch size of 150, we can start minimizing our function.

1def f(x, y): 2 return x ** 2 + y ** 2 3 4def df(x, y): 5 return np.array([2 * x, 2 * y]) 6 7coordinates = np.array([3.0, 4.0]) 8learning_rate = 0.02 9beta1 = 0.9 10beta2 = 0.9999 11epsilon = 1e-8 12max_epochs = 150 13 14m_prev = np.array([0, 0]) 15v_prev = np.array([0, 0]) 16 17for epoch in range(max_epochs + 1): 18 grad = df(coordinates[0], coordinates[1]) 19 updates, m_prev, v_prev = ADAM(beta1, beta2, epsilon, grad, m_prev, v_prev, learning_rate) 20 coordinates -= updates 21 if epoch % 30 == 0: 22 print(f"Epoch {epoch}, current state: {coordinates}")

The output of this code is the following:

1Epoch 0, current state: [2.80000003 3.80000002] 2Epoch 30, current state: [ 0.27175946 -0.35494334] 3Epoch 60, current state: [-0.07373187 -0.06706317] 4Epoch 90, current state: [-0.02001478 0.0301726 ] 5Epoch 120, current state: [ 0.00082782 -0.0039881 ] 6Epoch 150, current state: [ 0.00094425 -0.00038352]
ADAM vs Others

ADAM (Adaptive Moment Estimation) optimizer is generally more efficient than many other optimization algorithms such as SGD (Stochastic Gradient Descent) or RMSprop.

Overall, while the actual efficiency of ADAM compared to other optimizing algorithms can depend on the specific task or dataset, it often performs well in terms of both speed and accuracy across a variety of tasks.


Congratulations! You've now understood ADAM and how to code it in Python. With its sound mathematical foundations and impressive empirical results, ADAM constitutes an excellent stepping-stone into the fascinating world of machine learning optimization.

Remember, practice solidifies comprehension and consolidates understanding. Remember to attempt the upcoming hands-on exercises to reinforce these new burgeoning concepts. Until next time, happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.