Lesson 4

Hello! Today, we will dive into **RMSProp** (Root Mean Square Propagation). This sophisticated optimization algorithm accelerates convergence by adapting the learning rate for each weight separately, addressing the limitations of previous techniques such as **Stochastic Gradient Descent** (SGD), **Mini-Batch Gradient Descent**, and **momentum**. Our focus today is understanding RMSProp and coding it from scratch in Python to optimize multivariable functions.

Let's begin with a quick recap: `SGD`

and `Mini-Batch Gradient Descent`

can be sensitive to learning rates and may converge slowly. Even `momentum`

, which mitigates these issues to an extent, has limitations. When a uniform learning rate is applied across all parameters, efficient optimization might not be achieved. This is where RMSProp steps in to offer a solution.

RMSProp, an advanced optimization algorithm, adjusts the gradient descent step for each weight individually, accelerating training and allowing faster convergence. This optimization is achieved by RMSProp keeping track of a running average of the square of gradients and then using this information to scale the learning rate.

For `RMSProp`

, we add another layer to the update rule of `SGD`

. This additional layer scales each update with the inverse of the square root of the sum of squares of recent gradients. Here, gradients measure the quantity and direction of change for the weights. The mathematical expression is:

The first equation here represents the running average of the square of the gradients ($dw$). The term $\rho$ is a hyperparameter (generally set to 0.9) termed as "decay rate", which denotes the extent to which previous gradients impact the current update. The name decay rate comes from the fact that as we increase the number of iterations, the weightage given to the squares of the gradients of earlier iterations is reduced exponentially. Hence, more recent gradients have more impact on the update.

The second equation describes the weight (represented as $w$) update rule. We scale down the learning rate for weight with a large gradient to ensure that the learning process isn't very aggressive and that we prevent overshooting the minima in the loss landscape.

Note that the denominators inside the second formula combine the running averages of gradient squares ($s_{dw}$) with a small additive constant ($\epsilon$) to avoid division by zero. This constant also ensures numerical stability.

Let's now encapsulate the `RMSProp`

concept into Python code. We will define an `RMSProp`

function, which takes the learning rate, decay factor $\rho$, a small number $\epsilon$, gradient, and prior squared gradient (initialized to `0`

) as inputs and returns the updated parameters and updated squared gradients.

Python`1def RMSProp(learning_rate, rho, epsilon, grad, s_prev): 2 # Update squared gradient 3 s = rho * s_prev + (1 - rho) * np.power(grad, 2) 4 5 # Calculate updates 6 updates = learning_rate * grad / (np.sqrt(s) + epsilon) 7 return updates, s`

Now let's apply `RMSProp`

to find the minimum of a multivariable function `f(x, y) = x^2 + y^2`

. Corresponding gradients are `df/dx = 2*x`

and `df/dy = 2*y`

. We've set the initial starting point to `(x, y) = (5, 4)`

, and picked common choices for hyperparameters (`rho = 0.9`

, `epsilon = 1e-6`

, and `learning_rate = 0.1`

), running our optimizer over `100`

epochs.

Python`1def f(x, y): 2 return x**2 + y**2 3 4def df(x, y): 5 return np.array([2*x, 2*y]) 6 7coordinates = np.array([5.0, 4.0]) 8learning_rate = 0.1 9rho = 0.9 10epsilon = 1e-6 11max_epochs = 100 12 13s_prev = np.array([0, 0]) 14 15for epoch in range(max_epochs + 1): 16 grad = df(coordinates[0], coordinates[1]) 17 updates, s_prev = RMSProp(learning_rate, rho, epsilon, grad, s_prev) 18 coordinates -= updates 19 if epoch % 20 == 0: 20 print(f"Epoch {epoch}, current state: {coordinates}")`

The output of this code is as follows:

`1Epoch 0, current state: [4.68377233 3.68377236] 2Epoch 20, current state: [2.3688824 1.47561697] 3Epoch 40, current state: [0.95903672 0.35004133] 4Epoch 60, current state: [0.13761293 0.00745214] 5Epoch 80, current state: [3.91649374e-04 3.12725069e-09] 6Epoch 100, current state: [-3.07701828e-17 2.18862195e-20]`

As you can see, `x`

and `y`

quickly approach `0`

, which is indeed the minimum of the given function.

Lastly, we can compare the performance of `RMSProp`

with SGD, Mini-Batch Gradient Descent, or Momentum-based Gradient Descent by examining how efficiently each one arrives at the global minimum of a cost function. For a two-variable function like in the example, RMSProp is not going to be effective. Instead, it is known for its high efficiency in handling complex and large-scale machine learning tasks.

It reduces the oscillations and high variance in parameter updates by introducing the moving average into the gradient, often leading to quicker convergence and improved stability in the learning process. This makes it particularly useful for handling complex models and large datasets in deep learning applications.

Well done! Now, you comprehend RMSProp and can code it in Python. As an advanced optimization technique, RMSProp allows for faster convergence, making it a robust tool in your machine learning toolbox.

Next, we will have hands-on exercises for you to practice and reinforce these new concepts. Remember, practice strengthens learning and expands understanding. Happy coding!