Implementing Multiple Linear Regression from Scratch

Introduction

Welcome to our exciting second class in the Regression and Gradient Descent series! In the previous lesson, we covered Simple Linear Regression. Now, we're transitioning toward Multiple Linear Regression, a powerful tool for examining the relationship between a dependent variable and several independent variables.

Consider a case where we need to predict house prices, which undoubtedly depend on multiple factors, such as location, size, and the number of rooms. Multiple Linear Regression accounts for these simultaneous predictors. In today's lesson, you'll learn how to implement this concept in Python!

Multiple Linear Regression - The Concept

Multiple Linear Regression builds upon the concept of Simple Linear Regression, accounting for more than one independent variable.

Let's recall the Simple Linear Regression equation:

$y = \beta_0 + \beta_1x$

For Multiple Linear Regression, we add multiple independent variables, $x_1, x_2, ... x_m$ :

Linear Algebra Behind: Dataset Representation

Suppose we had n data points (equations), each with m features (x values) Then X would look like:

\mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \ldots & x_{1,m} \\ 1 & x_{2,1} & x_{2,2} & \ldots & x_{2,m} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \ldots & x_{n,m} \\ \end{bmatrix}

Each row represents the m features for a single data point. Notice how we include a column of 1's the represent the intercept (also called bias) of each equation.

For each row (equation), there is a corresponding y value. So y looks like:

\mathbf{y} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix}

The normal equation results in a vector:

\begin{bmatrix} \mathbf{β}_0 \\ \mathbf{β}_1 \\ \vdots \\ \mathbf{β}_{n} \end{bmatrix}

Linear Algebra Behind: Making a Prediction

Now, for any set of features ${x_{1}}$ through ${x_{m}}$ , we can predict the $\hat{y}$ value as:

$\hat{y} = (1 \cdot {β}_0) + ({β}_1 \cdot x_{1}) + ({β}_2 \cdot x_{2}) + ... + ({β}_m \cdot x_{m})$

To calculate all the predictions at once, we take the dot product of ${X}$ and ${β}$

\mathbf{y} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \ldots & x_{1,n} \\ 1 & x_{2,1} & x_{2,2} & \ldots & x_{2,n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{m,1} & x_{m,2} & \ldots & x_{m,n} \\ \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{n} \end{bmatrix} = X \cdot \mathbf{\beta}

Linear Algebra Behind: Math Solution

To implement Multiple Linear Regression, we'll leverage some Linear Algebra concepts. Using the Normal Equation, we can calculate the coefficients for our regression equation:

$\beta = (X^T X)^{-1} X^T y$

Where $X$ is a matrix of features and $y$ is a vector of the target variable values. Like Simple Linear Regression, residuals (the differences between actual and predicted values) play a significant role. The smaller these residuals, the better the model fits.

Implementing Multiple Linear Regression from Scratch

Let's roll up our sleeves and start coding! We'll primarily rely on NumPy to handle numerical operations and matrices.

First, we set up our dataset:

Python
1X = np.array([[73, 67, 43], 
2                   [91, 88, 64], 
3                   [87, 134, 58], 
4                   [102, 43, 37], 
5                   [69, 96, 70]], dtype='float32')
6
7y = np.array([56, 81, 119, 22, 103], dtype='float32')

Next, we calculate our matrix of coefficients, $\beta$ , using the Normal Equation:

Enhance our feature matrix, $X$ , with an extra column of ones to account for the intercept.

Python
1ones = np.ones(shape=(len(X), 1))
2X = np.append(ones, X, axis=1)

Compute the coefficients $\beta$ using the Normal Equation.

Python
1beta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

We could also use @ operator instead of .dot. You may choose the one you find more comfortable:

Python
1beta = np.linalg.inv(X.T @ X) @ X.T @ y

Model's Performance Evaluation

After completing our model, we need to evaluate its performance. We employ the coefficient of determination ( $R^2$ score) for that. It indicates how well our model fits the data. Let's recall it:

$R^2 = 1 - \frac{SS_{residuals}}{SS_{total}}$

Here, $SS_{residuals}$ denotes the residual sum of squares, and $SS_{total}$ is the total sum of squares:

$SS_{residuals} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2$ ,

where $y_i$ represents the observed values, $\hat{y_i}$ represents the predicted values by the regression model.

$SS_{total} = \sum_{i=1}^{n} (y_i - \bar{y})^2$ ,

where $y_i$ represents the observed values, $\bar{y}$ stands for mean value of observed data.

A higher $R^2$ value (closer to 1) indicates a good model fit.

Python
1predictions = X.dot(beta)
2ss_residuals = np.sum(np.square(y - predictions))
3ss_total = np.sum(np.square(y - np.mean(y)))
4r2_score = 1 - (ss_residuals/ss_total)
5
6print("R^2 Score:", r2_score)  # Output: R^2 Score: 0.9992

The $R^2$ score is very close to one, meaning the obtained model is very accurate – almost perfect!

Lesson Summary and Practice

Congratulations on mastering Multiple Linear Regression! You've effectively bridged the gap from concept to implementation, designing a regression model in Python from scratch.

Prepare for the upcoming lesson to delve more deeply into Regression Analysis. Meanwhile, make sure to practice and refine your newly acquired skills!