Lesson 3
Accelerating Convergence: Implementing Momentum in Gradient Descent Algorithms
Getting Started with Momentum

Hello! Today, we will learn about a powerful technique that makes our Gradient Descent move faster, like a ball rolling down a hill. We call this "Momentum".

What's Momentum and How It Works

Momentum improves our Gradient Descent. How does it do that? Remember how a ball on top of a hill starts rolling down? If the slope is steep, the ball picks up speed, right? That's what momentum does to our Gradient Descent. It makes it move faster when the slope (our 'hill') points in the same direction over time.

How to Add Momentum to Gradient Descent

Let's get down to coding! Here's a little piece of code to demonstrate the effect of momentum in a gradient descent process. We will use a gradient function, grad_func(). The weight or parameter (theta) starts at a point and moves down the slope by adjusting itself in every iteration or 'epoch':

v:=vγ+αgradientv := v \cdot \gamma + \alpha \cdot gradient

θ:=θv\theta := \theta - v

Where:

  • θ\theta is the parameter vector,
  • gradientgradient is the gradient of the cost function with regards to the parameters at the current parameter value,
  • α\alpha is the learning rate,
  • vv is the velocity vector (initialized to 0), and
  • γ\gamma is the momentum parameter (a new hyperparameter).

A higher γ\gamma will result in a faster convergence.

Here is the python implementation:

Python
1gradient = grad_func(theta) 2v = gamma * v + learning_rate * gradient 3theta = theta - v

We compute the gradient from the current parameters. Then, we calculate the new momentum, a combination of the old momentum, our learning rate, and the gradient. We update our parameter by subtracting this momentum from it.

Compare Gradient Descents: Setup

Now let's visualize how momentum aids in faster convergence (which means getting to the answer quicker) in the following code snippet:

Python
1import matplotlib.pyplot as plt 2import numpy as np 3 4def func(x): 5 return x**2 6 7def grad_func(x): 8 return 2*x 9 10gamma = 0.9 11learning_rate = 0.01 12v = 0 13epochs = 50 14 15theta_plain = 4.0 16theta_momentum = 4.0 17 18history_plain = [] 19history_momentum = [] 20 21for _ in range(epochs): 22 history_plain.append(theta_plain) 23 gradient = grad_func(theta_plain) 24 theta_plain = theta_plain - learning_rate * gradient 25 26 history_momentum.append(theta_momentum) 27 gradient = grad_func(theta_momentum) 28 v = gamma * v + learning_rate * gradient 29 theta_momentum = theta_momentum - v

Here, we implement plain and momentum gradients within one loop and track the history of weight changes to visualize them later.

Compare Gradient Descends: Visualization

Let's visualize the comparison:

Python
1plt.figure(figsize=(12, 7)) 2plt.plot([func(theta) for theta in history_plain], label='Gradient Descent') 3plt.plot([func(theta) for theta in history_momentum], label='Momentum-based Gradient Descent') 4plt.xlabel('Epoch') 5plt.ylabel('Cost') 6plt.legend() 7plt.grid() 8plt.show()

Here is the result:

Here, we compare Gradient Descent (without momentum) and Momentum-based Gradient Descent on the same function (x^2). The graph shows how the cost (value of the function) changes over time (or epochs). The cost gets smaller faster for the Momentum-based method. That's because it gets a speed boost from the momentum, just like the ball rolling down the hill!

Wrapping Up

You've done it! You've understood how to use momentum to improve Gradient Descent and seen it in action. Doesn't the ball-on-a-hill analogy make it easier to understand? Now, it's time to put your knowledge into practice! If you remember how a rolling ball picks up speed, you'll never forget how momentum improves Gradient Descent. Happy practicing and coding!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.