Lesson 5

Hello! Today, we will explore the **ADAM** (Adaptive Moment Estimation) algorithm. This advanced optimization algorithm is a favorite among machine learning practitioners as it combines the advantages of two other extensions of Stochastic Gradient Descent (SGD): Root Mean Square Propagation (RMSprop) and Adaptive Gradient Algorithm (AdaGrad). Our primary focus today is understanding ADAM, and we will also build it from scratch in Python to optimize multivariable functions.

Before we dive into ADAM, let us recall that classic gradient descent methods like `SGD`

and even sophisticated versions like `Momentum`

and `RMSProp`

have some limitations. These limitations relate to sensitivity to learning rates, the issue of vanishing gradients, and the absence of individual adaptive learning rates for different parameters.

ADAM, a promising choice for an optimization algorithm, combines the merits of `RMSProp`

and `AdaGrad`

. It maintains a per-parameter learning rate adapted based on the average of recent magnitudes of the gradients for the weights (similar to RMSProp) and the average of recent gradients (like Momentum). This mechanism enables the algorithm to traverse quickly over the low gradient regions and slow down near the optimal points.

For `ADAM`

, we modify the update rule of SGD, introducing two additional hyperparameters, `beta1`

and `beta2`

. The hyperparameter `beta1`

controls the exponential decay rate for the first-moment estimates (similar to Momentum), while `beta2`

controls the exponential decay rate for the second-moment estimates (similar to RMSProp). The mathematical expression can be formulated as follows:

Here, `m_t`

and `v_t`

are estimates of the gradients' first moment (the mean) and the second moment (the uncentered variance), respectively, while `grad`

represents the gradient. We also use an epsilon constant to maintain numerical stability and prevent division by zero, as in RMSProp.

Let's now consolidate the `ADAM`

concept into Python code. We will define an `ADAM`

function, which takes the gradients, the decay rates `beta1`

and `beta2`

, a numerical constant `epsilon`

, the learning rate, and previous estimates of `m`

and `v`

(initialized to `0`

) as input and returns the updated parameters, along with the updated `m`

and `v`

.

Python`1def ADAM(beta1, beta2, epsilon, grad, m_prev, v_prev, learning_rate): 2 # Update biased first-moment estimate 3 m = beta1 * m_prev + (1 - beta1) * grad 4 5 # Update biased second raw moment estimate 6 v = beta2 * v_prev + (1 - beta2) * np.power(grad, 2) 7 8 # Calculate updates 9 updates = learning_rate * m / (np.sqrt(v) + epsilon) 10 return updates, m, v`

`v`

and `m`

are initialized with zeros and therefore they are biased towards zero at the start of the optimization, especially when the decay rates are small (beta1 and beta2 close to 1).

To counteract these biases, Adam also usually includes the correction terms `m_hat`

and `v_hat`

. These terms adjust `m`

and `v`

by an amount that lessens as the number of time steps increases:

Python`1m_hat = m / (1 - np.power(beta1, epoch+1)) # Correcting the bias for the first moment 2v_hat = v / (1 - np.power(beta2, epoch+1)) # Correcting the bias for the second moment 3 4updates = learning_rate * m_hat / (np.sqrt(v_hat) + epsilon) 5return updates, m, v`

Note that we still return plain `m`

and `v`

.

Now, let's test `ADAM`

slightly by finding the minimum of a multivariable function `f(x, y) = x^2 + y^2`

. The corresponding gradients are `df/dx = 2*x`

and `df/dy = 2*y`

. With an initial starting point at `(x, y) = (3, 4)`

, selected reasonable values for `beta1=0.9`

, `beta2=0.9999`

, `epsilon=1e-8`

, `learning_rate=0.02`

and an epoch size of `150`

, we can start minimizing our function.

Python`1def f(x, y): 2 return x ** 2 + y ** 2 3 4def df(x, y): 5 return np.array([2 * x, 2 * y]) 6 7coordinates = np.array([3.0, 4.0]) 8learning_rate = 0.02 9beta1 = 0.9 10beta2 = 0.9999 11epsilon = 1e-8 12max_epochs = 150 13 14m_prev = np.array([0, 0]) 15v_prev = np.array([0, 0]) 16 17for epoch in range(max_epochs + 1): 18 grad = df(coordinates[0], coordinates[1]) 19 updates, m_prev, v_prev = ADAM(beta1, beta2, epsilon, grad, m_prev, v_prev, learning_rate) 20 coordinates -= updates 21 if epoch % 30 == 0: 22 print(f"Epoch {epoch}, current state: {coordinates}")`

The output of this code is the following:

`1Epoch 0, current state: [2.80000003 3.80000002] 2Epoch 30, current state: [ 0.27175946 -0.35494334] 3Epoch 60, current state: [-0.07373187 -0.06706317] 4Epoch 90, current state: [-0.02001478 0.0301726 ] 5Epoch 120, current state: [ 0.00082782 -0.0039881 ] 6Epoch 150, current state: [ 0.00094425 -0.00038352]`

ADAM (Adaptive Moment Estimation) optimizer is generally more efficient than many other optimization algorithms such as SGD (Stochastic Gradient Descent) or RMSprop.

Overall, while the actual efficiency of ADAM compared to other optimizing algorithms can depend on the specific task or dataset, it often performs well in terms of both speed and accuracy across a variety of tasks.

Congratulations! You've now understood ADAM and how to code it in Python. With its sound mathematical foundations and impressive empirical results, ADAM constitutes an excellent stepping-stone into the fascinating world of machine learning optimization.

Remember, practice solidifies comprehension and consolidates understanding. Remember to attempt the upcoming hands-on exercises to reinforce these new burgeoning concepts. Until next time, happy coding!