Lesson 3

Gradient Descent Optimization in Linear Regression

Introduction

Hello and welcome to another session on "Regression and Gradient Descent." In today's syllabus, we will construct and fit the gradient descent algorithm into a linear regression problem. Though linear regression does have a direct solution, gradient descent is essential for computational efficiency, especially when handling larger datasets or complex models.

The Concept of Gradient Descent

Gradient descent is an iterative optimization algorithm for minimizing a function, usually a loss function, quantifying the disparity between predicted and actual results. The goal of gradient descent is to find the parameters that minimize the value of the loss function. Importantly, gradient descent navigates its way to the minimum of the function by moving iteratively toward the direction of the steepest descent. However, to leverage gradient descent, the target function must be differentiable.

Taking Steps with Gradient Descent

Gradient descent derives its name from its working mechanism: taking descents along the gradient. It operates in several iterative steps as follows:

  1. Choose random values for initial parameters.
  2. Calculate the cost (the difference between actual and predicted value).
  3. Compute the gradient (the steepest slope of the function around that point).
  4. Update the parameters using the gradient.
  5. Repeat steps 2 to 4 until we reach an acceptable error rate or exhaust the maximum iterations.

A vital component of gradient descent is the learning rate, which determines the size of the descent towards the optimum solution. It is important to note that if the learning rate is too high, we may overshoot the minimum, and if it's too low, the convergence to the minimum may take too long.

Implementing Gradient Descent in Python: The Cost Function

Let's implement it from scratch with a basic understanding of the gradient descent algorithm. We will need two functions: one for calculating the cost and another for calculating and applying the gradient to update our parameters. Moreover, we'll add an early stop mechanism that will halt computations after a predefined number of iterations.

The cost function is as follows:

J(X,y,θ)=1mi=1m(Xθyi)2J(X, y, \theta) = \frac{1}{m} \sum_{i=1}^m (X\cdot\theta - y_i)^2

where JJ is the cost, XX is the data, yy is the actual values, θ\theta is the parameters, and mm is the length of yy. It is the calculation of the mean square error.

Python
1import numpy as np 2 3def cost(X, y, theta): 4 m = len(y) 5 predictions = X.dot(theta) 6 cost = (1/m) * np.sum(np.square(predictions-y)) # Compute mean square error 7 return cost
Implementing Gradient Descent in Python: The Gradient Descent

Next, for the gradient descent function, we follow the gradient descent update rule:

θ:=θα1mXT(Xθy)\theta := \theta - \alpha \frac{1}{m} X^T \cdot (X\cdot\theta - y)

Here, α\alpha is the learning rate, which determines the size of our steps in the descent. XTX^T is the transpose of the data. Note, that it should have been multiplied by 2, as we take the derivative of a the mean squared error, but we can ignore this 2 and just consider it as a part of the learning rate coefficient.

Python
1def gradient_descent(X, y, theta, alpha, iterations): 2 m = len(y) 3 cost_history = np.zeros(iterations) 4 theta_history = np.zeros((iterations,2)) 5 for i in range(iterations): # Iterate until convergence 6 prediction = np.dot(X,theta) # Matrix multiplication between X and theta 7 theta = theta - (1/m)*alpha*(X.T.dot((prediction - y))) # Gradient update rule 8 theta_history[i,:] = theta.T 9 cost_history[i] = cost(X,y,theta) 10 return theta, cost_history, theta_history
Applying Gradient Descent to Linear Regression

Let's apply our gradient descent function to a simple linear regression problem. The form of linear regression is:

y=ax+by = ax + b

where aa and bb are the parameters, θ\theta, we need to learn. The following data have been generated based on this form with some noise.

Python
1X = 2 * np.random.rand(100,1) 2y = 4 +3 * X+np.random.randn(100,1) 3 4lr = 0.01 # Learning Rate 5n_iter = 1000 # Max number of iterations 6theta = np.random.randn(2,1) # Randomly initialized parameters 7X_b = np.c_[np.ones((len(X),1)),X] # add bias parameter to X 8theta, cost_history, theta_history = gradient_descent(X_b,y,theta,lr,n_iter) # Gradient Descent
Lesson Summary and Practice

Congratulations! You have mastered implementing the gradient descent algorithm and its application to linear regression. We covered theoretical explanations, derived the math behind the cost function and the gradient descent update rule, and brought these concepts to life by coding in Python.

It is now time to practice and solidify what you have learned. In the upcoming exercises, challenge yourself with different problems and experiment with varying parameters like the learning rate. Enjoy your journey into the world of gradients!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.