Hello! Today, we will explore the ADAM (Adaptive Moment Estimation) algorithm. This advanced optimization algorithm is a favorite among machine learning practitioners as it combines the advantages of two other extensions of Stochastic Gradient Descent (SGD): Root Mean Square Propagation (RMSprop) and Adaptive Gradient Algorithm (AdaGrad). Our primary focus today is understanding ADAM, and we will also build it from scratch in Python to optimize multivariable functions.
Before we dive into ADAM, let us recall that classic gradient descent methods like SGD
and even sophisticated versions like Momentum
and RMSProp
have some limitations. These limitations relate to sensitivity to learning rates, the issue of vanishing gradients, and the absence of individual adaptive learning rates for different parameters.
ADAM, a promising choice for an optimization algorithm, combines the merits of RMSProp
and AdaGrad
. It maintains a per-parameter learning rate adapted based on the average of recent magnitudes of the gradients for the weights (similar to RMSProp) and the average of recent gradients (like Momentum). This mechanism enables the algorithm to traverse quickly over the low gradient regions and slow down near the optimal points.
For ADAM
, we modify the update rule of SGD, introducing two additional hyperparameters, beta1
and beta2
. The hyperparameter beta1
controls the exponential decay rate for the first-moment estimates (similar to Momentum), while beta2
controls the exponential decay rate for the second-moment estimates (similar to RMSProp). The mathematical expression can be formulated as follows:
Here, m_t
and v_t
are estimates of the gradients' first moment (the mean) and the second moment (the uncentered variance), respectively, while grad
represents the gradient. We also use an epsilon constant to maintain numerical stability and prevent division by zero, as in RMSProp.
Let's now consolidate the ADAM
concept into Python code. We will define an ADAM
function, which takes the gradients, the decay rates beta1
and beta2
, a numerical constant epsilon
, the learning rate, and previous estimates of m
and v
(initialized to 0
) as input and returns the updated parameters, along with the updated m
and v
.
Python1def ADAM(beta1, beta2, epsilon, grad, m_prev, v_prev, learning_rate): 2 # Update biased first-moment estimate 3 m = beta1 * m_prev + (1 - beta1) * grad 4 5 # Update biased second raw moment estimate 6 v = beta2 * v_prev + (1 - beta2) * np.power(grad, 2) 7 8 # Calculate updates 9 updates = learning_rate * m / (np.sqrt(v) + epsilon) 10 return updates, m, v
v
and m
are initialized with zeros and therefore they are biased towards zero at the start of the optimization, especially when the decay rates are small (beta1 and beta2 close to 1).
To counteract these biases, Adam also usually includes the correction terms m_hat
and v_hat
. These terms adjust m
and v
by an amount that lessens as the number of time steps increases:
Python1m_hat = m / (1 - np.power(beta1, epoch+1)) # Correcting the bias for the first moment 2v_hat = v / (1 - np.power(beta2, epoch+1)) # Correcting the bias for the second moment 3 4updates = learning_rate * m_hat / (np.sqrt(v_hat) + epsilon) 5return updates, m, v
Note that we still return plain m
and v
.
Now, let's test ADAM
slightly by finding the minimum of a multivariable function f(x, y) = x^2 + y^2
. The corresponding gradients are df/dx = 2*x
and df/dy = 2*y
. With an initial starting point at (x, y) = (3, 4)
, selected reasonable values for beta1=0.9
, beta2=0.9999
, epsilon=1e-8
, learning_rate=0.02
and an epoch size of 150
, we can start minimizing our function.
Python1def f(x, y): 2 return x ** 2 + y ** 2 3 4def df(x, y): 5 return np.array([2 * x, 2 * y]) 6 7coordinates = np.array([3.0, 4.0]) 8learning_rate = 0.02 9beta1 = 0.9 10beta2 = 0.9999 11epsilon = 1e-8 12max_epochs = 150 13 14m_prev = np.array([0, 0]) 15v_prev = np.array([0, 0]) 16 17for epoch in range(max_epochs + 1): 18 grad = df(coordinates[0], coordinates[1]) 19 updates, m_prev, v_prev = ADAM(beta1, beta2, epsilon, grad, m_prev, v_prev, learning_rate) 20 coordinates -= updates 21 if epoch % 30 == 0: 22 print(f"Epoch {epoch}, current state: {coordinates}")
The output of this code is the following:
1Epoch 0, current state: [2.80000003 3.80000002] 2Epoch 30, current state: [ 0.27175946 -0.35494334] 3Epoch 60, current state: [-0.07373187 -0.06706317] 4Epoch 90, current state: [-0.02001478 0.0301726 ] 5Epoch 120, current state: [ 0.00082782 -0.0039881 ] 6Epoch 150, current state: [ 0.00094425 -0.00038352]
ADAM (Adaptive Moment Estimation) optimizer is generally more efficient than many other optimization algorithms such as SGD (Stochastic Gradient Descent) or RMSprop.
Overall, while the actual efficiency of ADAM compared to other optimizing algorithms can depend on the specific task or dataset, it often performs well in terms of both speed and accuracy across a variety of tasks.
Congratulations! You've now understood ADAM and how to code it in Python. With its sound mathematical foundations and impressive empirical results, ADAM constitutes an excellent stepping-stone into the fascinating world of machine learning optimization.
Remember, practice solidifies comprehension and consolidates understanding. Remember to attempt the upcoming hands-on exercises to reinforce these new burgeoning concepts. Until next time, happy coding!