Lesson 3

Hello and welcome to another session on **"Regression and Gradient Descent."** In today's syllabus, we will construct and fit the gradient descent algorithm into a linear regression problem. Though linear regression does have a direct solution, gradient descent is essential for computational efficiency, especially when handling larger datasets or complex models.

**Gradient descent** is an iterative optimization algorithm for minimizing a function, usually a loss function, quantifying the disparity between predicted and actual results. The goal of gradient descent is to find the parameters that minimize the value of the loss function. Importantly, gradient descent navigates its way to the minimum of the function by moving iteratively toward the direction of the steepest descent. However, to leverage gradient descent, the target function must be differentiable.

Gradient descent derives its name from its working mechanism: taking *descents* along the *gradient*. It operates in several iterative steps as follows:

- Choose random values for initial parameters.
- Calculate the cost (the difference between actual and predicted value).
- Compute the gradient (the steepest slope of the function around that point).
- Update the parameters using the gradient.
- Repeat steps 2 to 4 until we reach an acceptable error rate or exhaust the maximum iterations.

A vital component of gradient descent is the learning rate, which determines the size of the descent towards the optimum solution. It is important to note that if the learning rate is too high, we may overshoot the minimum, and if it's too low, the convergence to the minimum may take too long.

Let's implement it from scratch with a basic understanding of the gradient descent algorithm. We will need two functions: one for calculating the cost and another for calculating and applying the gradient to update our parameters. Moreover, we'll add an early stop mechanism that will halt computations after a predefined number of iterations.

The cost function is as follows:

$J(X, y, \theta) = \frac{1}{m} \sum_{i=1}^m (X\cdot\theta - y_i)^2$where $J$ is the cost, $X$ is the data, $y$ is the actual values, $\theta$ is the parameters, and $m$ is the length of $y$. It is the calculation of the mean square error.

Python`1import numpy as np 2 3def cost(X, y, theta): 4 m = len(y) 5 predictions = X.dot(theta) 6 cost = (1/m) * np.sum(np.square(predictions-y)) # Compute mean square error 7 return cost`

Next, for the gradient descent function, we follow the gradient descent update rule:

$\theta := \theta - \alpha \frac{1}{m} X^T \cdot (X\cdot\theta - y)$Here, $\alpha$ is the learning rate, which determines the size of our steps in the descent. $X^T$ is the transpose of the data. Note, that it should have been multiplied by `2`

, as we take the derivative of a the mean **squared** error, but we can ignore this `2`

and just consider it as a part of the learning rate coefficient.

Python`1def gradient_descent(X, y, theta, alpha, iterations): 2 m = len(y) 3 cost_history = np.zeros(iterations) 4 theta_history = np.zeros((iterations,2)) 5 for i in range(iterations): # Iterate until convergence 6 prediction = np.dot(X,theta) # Matrix multiplication between X and theta 7 theta = theta - (1/m)*alpha*(X.T.dot((prediction - y))) # Gradient update rule 8 theta_history[i,:] = theta.T 9 cost_history[i] = cost(X,y,theta) 10 return theta, cost_history, theta_history`

Let's apply our gradient descent function to a simple linear regression problem. The form of linear regression is:

$y = ax + b$where $a$ and $b$ are the parameters, $\theta$, we need to learn. The following data have been generated based on this form with some noise.

Python`1X = 2 * np.random.rand(100,1) 2y = 4 +3 * X+np.random.randn(100,1) 3 4lr = 0.01 # Learning Rate 5n_iter = 1000 # Max number of iterations 6theta = np.random.randn(2,1) # Randomly initialized parameters 7X_b = np.c_[np.ones((len(X),1)),X] # add bias parameter to X 8theta, cost_history, theta_history = gradient_descent(X_b,y,theta,lr,n_iter) # Gradient Descent`

Congratulations! You have mastered implementing the gradient descent algorithm and its application to linear regression. We covered theoretical explanations, derived the math behind the cost function and the gradient descent update rule, and brought these concepts to life by coding in Python.

It is now time to practice and solidify what you have learned. In the upcoming exercises, challenge yourself with different problems and experiment with varying parameters like the learning rate. Enjoy your journey into the world of gradients!