Lesson 2

Implementing Multiple Linear Regression from Scratch


Welcome to our exciting second class in the Regression and Gradient Descent series! In the previous lesson, we covered Simple Linear Regression. Now, we're transitioning toward Multiple Linear Regression, a powerful tool for examining the relationship between a dependent variable and several independent variables.

Consider a case where we need to predict house prices, which undoubtedly depend on multiple factors, such as location, size, and the number of rooms. Multiple Linear Regression accounts for these simultaneous predictors. In today's lesson, you'll learn how to implement this concept in Python!

Multiple Linear Regression - The Concept

Multiple Linear Regression builds upon the concept of Simple Linear Regression, accounting for more than one independent variable.

Let's recall the Simple Linear Regression equation:

y=β0+β1xy = \beta_0 + \beta_1x

For Multiple Linear Regression, we add multiple independent variables, x1,x2,...xmx_1, x_2, ... x_m:

Linear Algebra Behind: Dataset Representation

Suppose we had n data points (equations), each with m features (x values) Then X would look like:

X=[1x1,1x1,2x1,m1x2,1x2,2x2,m1xn,1xn,2xn,m]\mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \ldots & x_{1,m} \\ 1 & x_{2,1} & x_{2,2} & \ldots & x_{2,m} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \ldots & x_{n,m} \\ \end{bmatrix}

Each row represents the m features for a single data point. Notice how we include a column of 1's the represent the intercept (also called bias) of each equation.

For each row (equation), there is a corresponding y value. So y looks like:

y=[y1y2ym]\mathbf{y} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix}

The normal equation results in a vector:

[β0β1βn]\begin{bmatrix} \mathbf{β}_0 \\ \mathbf{β}_1 \\ \vdots \\ \mathbf{β}_{n} \end{bmatrix}
Linear Algebra Behind: Making a Prediction

Now, for any set of features x1{x_{1}} through xm{x_{m}}, we can predict the y^\hat{y} value as:

y^=(1β0)+(β1x1)+(β2x2)+...+(βmxm)\hat{y} = (1 \cdot {β}_0) + ({β}_1 \cdot x_{1}) + ({β}_2 \cdot x_{2}) + ... + ({β}_m \cdot x_{m})

To calculate all the predictions at once, we take the dot product of X{X} and β{β}

y=[y1y2ym]=[1x1,1x1,2x1,n1x2,1x2,2x2,n1xm,1xm,2xm,n][β0β1βn]=Xβ\mathbf{y} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \ldots & x_{1,n} \\ 1 & x_{2,1} & x_{2,2} & \ldots & x_{2,n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{m,1} & x_{m,2} & \ldots & x_{m,n} \\ \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{n} \end{bmatrix} = X \cdot \mathbf{\beta}
Linear Algebra Behind: Math Solution

To implement Multiple Linear Regression, we'll leverage some Linear Algebra concepts. Using the Normal Equation, we can calculate the coefficients for our regression equation:

β=(XTX)1XTy\beta = (X^T X)^{-1} X^T y

Where XX is a matrix of features and yy is a vector of the target variable values. Like Simple Linear Regression, residuals (the differences between actual and predicted values) play a significant role. The smaller these residuals, the better the model fits.

Implementing Multiple Linear Regression from Scratch

Let's roll up our sleeves and start coding! We'll primarily rely on NumPy to handle numerical operations and matrices.

First, we set up our dataset:

1X = np.array([[73, 67, 43], 2 [91, 88, 64], 3 [87, 134, 58], 4 [102, 43, 37], 5 [69, 96, 70]], dtype='float32') 6 7y = np.array([56, 81, 119, 22, 103], dtype='float32')

Next, we calculate our matrix of coefficients, β\beta, using the Normal Equation:

  1. Enhance our feature matrix, XX, with an extra column of ones to account for the intercept.
1ones = np.ones(shape=(len(X), 1)) 2X = np.append(ones, X, axis=1)
  1. Compute the coefficients β\beta using the Normal Equation.
1beta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

We could also use @ operator instead of .dot. You may choose the one you find more comfortable:

1beta = np.linalg.inv(X.T @ X) @ X.T @ y
Model's Performance Evaluation

After completing our model, we need to evaluate its performance. We employ the coefficient of determination (R2R^2 score) for that. It indicates how well our model fits the data. Let's recall it:

R2=1SSresidualsSStotalR^2 = 1 - \frac{SS_{residuals}}{SS_{total}}

Here, SSresidualsSS_{residuals} denotes the residual sum of squares, and SStotalSS_{total} is the total sum of squares:

SSresiduals=i=1n(yiyi^)2SS_{residuals} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2,

where yiy_i represents the observed values, yi^\hat{y_i} represents the predicted values by the regression model.

SStotal=i=1n(yiyˉ)2SS_{total} = \sum_{i=1}^{n} (y_i - \bar{y})^2,

where yiy_i represents the observed values, yˉ\bar{y} stands for mean value of observed data.

A higher R2R^2 value (closer to 1) indicates a good model fit.

1predictions = X.dot(beta) 2ss_residuals = np.sum(np.square(y - predictions)) 3ss_total = np.sum(np.square(y - np.mean(y))) 4r2_score = 1 - (ss_residuals/ss_total) 5 6print("R^2 Score:", r2_score) # Output: R^2 Score: 0.9992

The R2R^2 score is very close to one, meaning the obtained model is very accurate – almost perfect!

Lesson Summary and Practice

Congratulations on mastering Multiple Linear Regression! You've effectively bridged the gap from concept to implementation, designing a regression model in Python from scratch.

Prepare for the upcoming lesson to delve more deeply into Regression Analysis. Meanwhile, make sure to practice and refine your newly acquired skills!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.