Lesson 3

Welcome back! Today, we'll explore **Gradient Descent with Momentum**. You've already familiarized yourself with basic gradient descent. However, sometimes gradient descent is slow and gets stuck, especially on bumpy paths to the minimum.

So, how do we speed it up? We use **momentum**. Imagine pushing a heavy shopping cart. Instead of stopping and starting, you build momentum. This helps you move faster. By the end of this lesson, you'll understand how gradient descent with momentum works, implement it in Python, and see how it improves optimization.

Momentum in optimization is like a push in physical movement. It reduces oscillations (back-and-forth movements) as you head toward the minimum point. Essentially, it helps you move down faster and more steadily.

Let's look at the formulas for momentum:

$v_t = \beta v_{t-1} - \alpha \nabla f(\theta_{t-1})$ $\theta_t = \theta_{t-1} + v_t$Here's what these terms mean:

- $v_t$: Velocity at time step $t$. It shows speed and direction.
- $\theta$: This is the common symbol for the parameters of the function we're optimizing, similar to variables like $x$ and $y$ in a function $f(x, y)$.
- $\beta$: Momentum term, usually between 0 and 1. Higher values mean more momentum.
- $\alpha$: Learning rate, controlling step size.
- $\nabla f(\theta_{t-1})$: Gradient of function $f$ at the current point.
- $\theta_t$: Updated parameter at time step $t$, bringing us closer to the minimum.

In the basic gradient descent, the update to the parameters is directly proportional to the gradient of the function. However, this approach can be slow due to the oscillations around the minimum.

With momentum, the velocity term $v_t$ is introduced to accelerate gradient vectors in the right directions. Here's a detailed breakdown of how the velocity term works:

**Initial Update (at $t=0$)**- Velocity starts at zero: $v_0 = 0$.
- The first update is similar to basic gradient descent.

**Subsequent Updates (at $t>0$)**- The velocity is updated with a fraction of the previous velocity, $v_{t-1}$, scaled by $\beta$, and the current gradient, scaled by $\alpha$.
- This means that the update direction is influenced not just by the current gradient but also by the past gradients, providing a smoothing effect.
- The point is updated using this velocity.

In essence, the velocity term accumulates the gradients of past steps to create a more stable and faster approach toward the minimum.

Now, let's implement **Gradient Descent with Momentum** in Python. Here's the code snippet:

Python`1# Gradient descent with momentum 2def gradient_descent_with_momentum(f_grad, init_point, learning_rate=0.1, momentum=0.9, iterations=100): 3 point = list(init_point) 4 velocity = [0] * len(point) 5 for _ in range(iterations): 6 grad = f_grad(point) 7 for i in range(len(point)): 8 velocity[i] = momentum * velocity[i] - learning_rate * grad[i] 9 point[i] += velocity[i] 10 return point`

Here’s a breakdown of the key lines in the code:

`velocity = [0] * len(point)`

: Initializes the velocity vector with zeros, having the same length as the starting point.`velocity[i] = momentum * velocity[i] - learning_rate * grad[i]`

: Updates the velocity by applying the momentum and subtracting the gradient scaled by the learning rate.`point[i] += velocity[i]`

: Updates the current point using the newly calculated velocity.

Here's the continuation of our implementation with the example function and initial point:

Python`1# Example function: f(x, y) = x^2 + y^2 2def quadratic_gradient(point): 3 x, y = point 4 return [2 * x, 2 * y] 5 6# Initial point 7init_point = [2, 2] 8 9# Find the optimal point using gradient descent with momentum 10optimal_point_momentum = gradient_descent_with_momentum(quadratic_gradient, init_point, learning_rate=0.1, iterations=100) 11print("Optimal point after gradient descent with momentum:", optimal_point_momentum) # Optimal point after gradient descent with momentum: [~0, ~0]`

Using momentum in gradient descent offers several benefits:

**Faster Convergence**: Reaches the minimum quicker.**Reduced Oscillations**: Smoothens the path, reducing back-and-forth movements.**Better Navigation Through Local Minima**: Avoids getting stuck in small bumps and oscillations.

To visually understand this, let's compare the paths of basic gradient descent and gradient descent with momentum. We will take $f(x, y) = x^2 + y^2$ and apply both types of gradient with just `3`

iterations:

You can clearly see how the gradient descent with momentum converges faster.

Congratulations! You've learned about **Gradient Descent with Momentum**. We covered its importance, how it works, and implemented it in Python. You've seen how it speeds up optimization and reduces oscillations.

Now, let’s practice. In the practice session, you'll implement Gradient Descent with Momentum and observe its effects on different functions. Get ready to solidify your understanding and see momentum in action!