Lesson 3

Hello! Today, we will learn about a powerful technique that makes our Gradient Descent move faster, like a ball rolling down a hill. We call this "Momentum".

Momentum improves our `Gradient Descent`

. How does it do that? Remember how a ball on top of a hill starts rolling down? If the slope is steep, the ball picks up speed, right? That's what momentum does to our `Gradient Descent`

. It makes it move faster when the slope (our 'hill') points in the same direction over time.

Let's get down to coding! Here's a little piece of code to demonstrate the effect of momentum in a gradient descent process. We will use a gradient function, `grad_func()`

. The weight or parameter (`theta`

) starts at a point and moves down the slope by adjusting itself in every iteration or 'epoch':

$v := v \cdot \gamma + \alpha \cdot gradient$

$\theta := \theta - v$

Where:

- $\theta$ is the parameter vector,
- $gradient$ is the gradient of the cost function with regards to the parameters at the current parameter value,
- $\alpha$ is the learning rate,
- $v$ is the velocity vector (initialized to 0), and
- $\gamma$ is the momentum parameter (a new hyperparameter).

A higher $\gamma$ will result in a faster convergence.

Here is the python implementation:

Python`1 gradient = grad_func(theta) 2 v = gamma * v + learning_rate * gradient 3 theta = theta - v`

We compute the gradient from the current parameters. Then, we calculate the new momentum, a combination of the old momentum, our learning rate, and the gradient. We update our parameter by subtracting this momentum from it.

Now let's visualize how momentum aids in faster convergence (which means getting to the answer quicker) in the following code snippet:

Python`1import matplotlib.pyplot as plt 2import numpy as np 3 4def func(x): 5 return x**2 6 7def grad_func(x): 8 return 2*x 9 10gamma = 0.9 11learning_rate = 0.01 12v = 0 13epochs = 50 14 15theta_plain = 4.0 16theta_momentum = 4.0 17 18history_plain = [] 19history_momentum = [] 20 21for _ in range(epochs): 22 history_plain.append(theta_plain) 23 gradient = grad_func(theta_plain) 24 theta_plain = theta_plain - learning_rate * gradient 25 26 history_momentum.append(theta_momentum) 27 gradient = grad_func(theta_momentum) 28 v = gamma * v + learning_rate * gradient 29 theta_momentum = theta_momentum - v`

Here, we implement plain and momentum gradients within one loop and track the history of weight changes to visualize them later.

Let's visualize the comparison:

Python`1plt.figure(figsize=(12, 7)) 2plt.plot([func(theta) for theta in history_plain], label='Gradient Descent') 3plt.plot([func(theta) for theta in history_momentum], label='Momentum-based Gradient Descent') 4plt.xlabel('Epoch') 5plt.ylabel('Cost') 6plt.legend() 7plt.grid() 8plt.show()`

Here is the result:

Here, we compare Gradient Descent (without momentum) and Momentum-based Gradient Descent on the same function (`x^2`

). The graph shows how the cost (value of the function) changes over time (or epochs). The cost gets smaller faster for the Momentum-based method. That's because it gets a speed boost from the momentum, just like the ball rolling down the hill!

You've done it! You've understood how to use momentum to improve Gradient Descent and seen it in action. Doesn't the ball-on-a-hill analogy make it easier to understand? Now, it's time to put your knowledge into practice! If you remember how a rolling ball picks up speed, you'll never forget how momentum improves Gradient Descent. Happy practicing and coding!