Understanding and Implementing RMSProp in Python

Lesson 4

Introduction to RMSProp

Hello! Today, we will dive into RMSProp (Root Mean Square Propagation). This sophisticated optimization algorithm accelerates convergence by adapting the learning rate for each weight separately, addressing the limitations of previous techniques such as Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent, and momentum. Our focus today is understanding RMSProp and coding it from scratch in Python to optimize multivariable functions.

Recap on Gradient Descent Techniques

Let's begin with a quick recap: SGD and Mini-Batch Gradient Descent can be sensitive to learning rates and may converge slowly. Even momentum, which mitigates these issues to an extent, has limitations. When a uniform learning rate is applied across all parameters, efficient optimization might not be achieved. This is where RMSProp steps in to offer a solution.

Understanding RMSProp

RMSProp, an advanced optimization algorithm, adjusts the gradient descent step for each weight individually, accelerating training and allowing faster convergence. This optimization is achieved by RMSProp keeping track of a running average of the square of gradients and then using this information to scale the learning rate.

RMSProp Mathematically

For RMSProp, we add another layer to the update rule of SGD. This additional layer scales each update with the inverse of the square root of the sum of squares of recent gradients. Here, gradients measure the quantity and direction of change for the weights. The mathematical expression is:

s_{dw} = \rho * s_{dw} + (1-\rho){dw}^{2}

w = w - \alpha\frac{dw}{\sqrt{s_{dw}} + \epsilon}

The first equation here represents the running average of the square of the gradients ( $dw$ ). The term $\rho$ is a hyperparameter (generally set to 0.9) termed as "decay rate", which denotes the extent to which previous gradients impact the current update. The name decay rate comes from the fact that as we increase the number of iterations, the weightage given to the squares of the gradients of earlier iterations is reduced exponentially. Hence, more recent gradients have more impact on the update.

The second equation describes the weight (represented as $w$ ) update rule. We scale down the learning rate for weight with a large gradient to ensure that the learning process isn't very aggressive and that we prevent overshooting the minima in the loss landscape.

Note that the denominators inside the second formula combine the running averages of gradient squares ( $s_{dw}$ ) with a small additive constant ( $\epsilon$ ) to avoid division by zero. This constant also ensures numerical stability.

RMSProp in Python Code

Let's now encapsulate the RMSProp concept into Python code. We will define an RMSProp function, which takes the learning rate, decay factor $\rho$ , a small number $\epsilon$ , gradient, and prior squared gradient (initialized to 0) as inputs and returns the updated parameters and updated squared gradients.

Python
1def RMSProp(learning_rate, rho, epsilon, grad, s_prev):
2    # Update squared gradient
3    s = rho * s_prev + (1 - rho) * np.power(grad, 2)
4
5    # Calculate updates
6    updates = learning_rate * grad / (np.sqrt(s) + epsilon)
7    return updates, s

Application of RMSProp on Multivariable Function Optimization

Now let's apply RMSProp to find the minimum of a multivariable function f(x, y) = x^2 + y^2. Corresponding gradients are df/dx = 2*x and df/dy = 2*y. We've set the initial starting point to (x, y) = (5, 4), and picked common choices for hyperparameters (rho = 0.9, epsilon = 1e-6, and learning_rate = 0.1), running our optimizer over 100 epochs.

Python
1def f(x, y):
2    return x**2 + y**2
3
4def df(x, y):
5    return np.array([2*x, 2*y])
6
7coordinates = np.array([5.0, 4.0])
8learning_rate = 0.1
9rho = 0.9
10epsilon = 1e-6
11max_epochs = 100
12
13s_prev = np.array([0, 0])
14
15for epoch in range(max_epochs + 1):
16    grad = df(coordinates[0], coordinates[1])
17    updates, s_prev = RMSProp(learning_rate, rho, epsilon, grad, s_prev)
18    coordinates -= updates
19    if epoch % 20 == 0:
20        print(f"Epoch {epoch}, current state: {coordinates}")

The output of this code is as follows:


1Epoch 0, current state: [4.68377233 3.68377236]
2Epoch 20, current state: [2.3688824  1.47561697]
3Epoch 40, current state: [0.95903672 0.35004133]
4Epoch 60, current state: [0.13761293 0.00745214]
5Epoch 80, current state: [3.91649374e-04 3.12725069e-09]
6Epoch 100, current state: [-3.07701828e-17  2.18862195e-20]

As you can see, x and y quickly approach 0, which is indeed the minimum of the given function.

Evaluation of RMSProp Over Other Gradient Descent Techniques

Lastly, we can compare the performance of RMSProp with SGD, Mini-Batch Gradient Descent, or Momentum-based Gradient Descent by examining how efficiently each one arrives at the global minimum of a cost function. For a two-variable function like in the example, RMSProp is not going to be effective. Instead, it is known for its high efficiency in handling complex and large-scale machine learning tasks.

It reduces the oscillations and high variance in parameter updates by introducing the moving average into the gradient, often leading to quicker convergence and improved stability in the learning process. This makes it particularly useful for handling complex models and large datasets in deep learning applications.

Conclusion

Well done! Now, you comprehend RMSProp and can code it in Python. As an advanced optimization technique, RMSProp allows for faster convergence, making it a robust tool in your machine learning toolbox.

Next, we will have hands-on exercises for you to practice and reinforce these new concepts. Remember, practice strengthens learning and expands understanding. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.