Gradient Descent with Momentum

Lesson 3

Lesson Introduction

Welcome back! Today, we'll explore Gradient Descent with Momentum. You've already familiarized yourself with basic gradient descent. However, sometimes gradient descent is slow and gets stuck, especially on bumpy paths to the minimum.

So, how do we speed it up? We use momentum. Imagine pushing a heavy shopping cart. Instead of stopping and starting, you build momentum. This helps you move faster. By the end of this lesson, you'll understand how gradient descent with momentum works, implement it in Python, and see how it improves optimization.

Introducing Momentum

Momentum in optimization is like a push in physical movement. It reduces oscillations (back-and-forth movements) as you head toward the minimum point. Essentially, it helps you move down faster and more steadily.

Let's look at the formulas for momentum:

v_t = \beta v_{t-1} - \alpha \nabla f(\theta_{t-1})

\theta_t = \theta_{t-1} + v_t

Here's what these terms mean:

$v_t$ : Velocity at time step $t$ . It shows speed and direction.
$\theta$ : This is the common symbol for the parameters of the function we're optimizing, similar to variables like $x$ and $y$ in a function $f(x, y)$ .
$\beta$ : Momentum term, usually between 0 and 1. Higher values mean more momentum.
$\alpha$ : Learning rate, controlling step size.
$\nabla f(\theta_{t-1})$ : Gradient of function $f$ at the current point.
$\theta_t$ : Updated parameter at time step $t$ , bringing us closer to the minimum.

How Velocity Works in Gradient Descent with Momentum

In the basic gradient descent, the update to the parameters is directly proportional to the gradient of the function. However, this approach can be slow due to the oscillations around the minimum.

With momentum, the velocity term $v_t$ is introduced to accelerate gradient vectors in the right directions. Here's a detailed breakdown of how the velocity term works:

Initial Update (at $t=0$ )
- Velocity starts at zero: $v_0 = 0$ .
- The first update is similar to basic gradient descent.
Subsequent Updates (at $t>0$ )
- The velocity is updated with a fraction of the previous velocity, $v_{t-1}$ , scaled by $\beta$ , and the current gradient, scaled by $\alpha$ .
- This means that the update direction is influenced not just by the current gradient but also by the past gradients, providing a smoothing effect.
- The point is updated using this velocity.

In essence, the velocity term accumulates the gradients of past steps to create a more stable and faster approach toward the minimum.

Python Implementation: part 1

Now, let's implement Gradient Descent with Momentum in Python. Here's the code snippet:

Python
1# Gradient descent with momentum
2def gradient_descent_with_momentum(f_grad, init_point, learning_rate=0.1, momentum=0.9, iterations=100):
3    point = list(init_point)
4    velocity = [0] * len(point)
5    for _ in range(iterations):
6        grad = f_grad(point)
7        for i in range(len(point)):
8            velocity[i] = momentum * velocity[i] - learning_rate * grad[i]
9            point[i] += velocity[i]
10    return point

Here’s a breakdown of the key lines in the code:

velocity = [0] * len(point): Initializes the velocity vector with zeros, having the same length as the starting point.
velocity[i] = momentum * velocity[i] - learning_rate * grad[i]: Updates the velocity by applying the momentum and subtracting the gradient scaled by the learning rate.
point[i] += velocity[i]: Updates the current point using the newly calculated velocity.

Python Implementation: part 2

Here's the continuation of our implementation with the example function and initial point:

Python
1# Example function: f(x, y) = x^2 + y^2
2def quadratic_gradient(point):
3    x, y = point
4    return [2 * x, 2 * y]
5
6# Initial point
7init_point = [2, 2]
8
9# Find the optimal point using gradient descent with momentum
10optimal_point_momentum = gradient_descent_with_momentum(quadratic_gradient, init_point, learning_rate=0.1, iterations=100)
11print("Optimal point after gradient descent with momentum:", optimal_point_momentum)  # Optimal point after gradient descent with momentum: [~0, ~0]

Benefits of Using Momentum

Using momentum in gradient descent offers several benefits:

Faster Convergence: Reaches the minimum quicker.
Reduced Oscillations: Smoothens the path, reducing back-and-forth movements.
Better Navigation Through Local Minima: Avoids getting stuck in small bumps and oscillations.

Visualizing Momentum

To visually understand this, let's compare the paths of basic gradient descent and gradient descent with momentum. We will take $f(x, y) = x^2 + y^2$ and apply both types of gradient with just 3 iterations:

You can clearly see how the gradient descent with momentum converges faster.

Lesson Summary

Congratulations! You've learned about Gradient Descent with Momentum. We covered its importance, how it works, and implemented it in Python. You've seen how it speeds up optimization and reduces oscillations.

Now, let’s practice. In the practice session, you'll implement Gradient Descent with Momentum and observe its effects on different functions. Get ready to solidify your understanding and see momentum in action!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.