Lesson 2
Linear Regression: From Basics to Predictive Modeling
Introduction

Welcome to this fascinating lesson on regression analysis, where we will delve into the realm of Regression. Before we continue working with the California Housing Dataset, we're going to take a brief detour to explain regression with a simpler dataset. This will help us to understand the principles of linear regression, construct a linear regression model in Python, compute coefficients, and predict values with our mathematical model in a more controlled and comprehendible setting. Are you ready to decode regression analysis?

Creating a Simple Dataset

Before implementing our regression model, let's create a simple dataset to be used in our computations. Consider a simple scenario where x represents some feature values (independent variables) and y corresponds to target values (dependent variables). Our aim is to compute the values of y based on x values, therefore finding a line that fits our data.

Let's delve into Python code to shape our dataset:

Python
1# We will first define our hypothetical dataset 2x = [1, 2, 3, 4, 5] # The feature values (independent variable) 3y = [1, 2, 3, 2, 3] # The target values (dependent variable)
Understanding Regression

At the heart of statistics lies Regression Analysis, a powerful tool that draws connections among variables. Imagine sketching a line or crafting a curve that best fits the distribution of data on a two-dimensional plane. Pretty cool, right?

With independent variables affecting dependent variables, regression analysis carves out a path to render these connections visible. Look out for Linear Regression, an integral method to predict a dependent variable value (y) based on the value of an independent variable (x). As represented by the following formula:

y=βx+αy = \beta x + \alpha

Where:

  • yy is the dependent variable we aim to predict.
  • β\beta represents the slope of the regression line, indicating how much yy changes with a unit change in xx.
  • α\alpha is the y-intercept of the regression line.
  • xx is the independent variable.

This equation is the cornerstone of linear regression, providing a straightforward approach to predicting the dependent variable based on the independent variable.

Calculation of Regression Line Coefficients

In this section, we'll dive into computing the coefficients alpha (α) and beta (β) of our regression line, which are crucial for creating our predictive model. To accomplish this, let's first understand the formulas used to determine α and β.

The slope of the line (β) can be calculated using the formula: β=(xixˉ)(yiyˉ)(xixˉ)2\beta = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}

Where:

  • xix_i and yiy_i are the individual sample points for the independent and dependent variables, respectively.
  • xˉ\bar{x} and yˉ\bar{y} are the mean values of the independent variable (x) and dependent variable (y), respectively.

Once we have β, we can find the intercept (α) using the formula: α=yˉβxˉ\alpha = \bar{y} - \beta\bar{x}

Now let's transition to the Python code necessary to compute these coefficients.

Python
1# Compute the mean (average) of x and y 2mean_x = sum(x) / len(x) 3mean_y = sum(y) / len(y) 4 5# Calculate the numerator and the denominator for the slope (β) 6numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y)) 7denominator = sum((xi - mean_x) ** 2 for xi in x) 8 9# Compute the coefficients 10beta = numerator / denominator 11alpha = mean_y - (beta * mean_x) 12 13# Printing Coefficients 14print('Beta:', beta ,'Alpha:', alpha) # Output: Beta: 0.4 Alpha: 1.0

First, the numerator is calculated by multiplying the deviations of each x and y value from their means and summing these products, which measures the covariance of x and y and provides insight into how y varies with x. Next, the denominator is determined by summing the squared deviations of x values from their mean, capturing the variance of x and reflecting its spread. Then, β is found by dividing the numerator by the denominator, representing the slope of the regression line and indicating the average change in y per unit change in x. Finally, α is computed by using the mean of y and subtracting the product of β and the mean of x, which appropriately positions the regression line by setting its y-intercept.

Through this streamlined process, we've calculated the essential coefficients that define the relationship between our variables in the linear regression model. Resulting in the formula for our linear regression, which is y=0.4x+1y = 0.4 x + 1

Implementing the Regression Model

Armed with alpha and beta, we can now code a function to calculate our regression line.

Python
1# Function to make predictions 2def predict_y(alpha, beta, x_i): 3 return beta * x_i + alpha
Making Predictions

It's time to put our regression model to work and conjure up some predictions!

Python
1# Making predictions 2y_pred = [predict_y(alpha, beta, x_i) for x_i in x]
Visualizing the Data

Let's illustrate our actual data points and the regression line for a graphical treat, and also include the prediction of a single point based on input and plot it alongside for a comprehensive visualization.

Python
1import matplotlib.pyplot as plt 2 3# Visualizing the data 4plt.scatter(x, y, color='blue') # Actual data points 5plt.plot(x, y_pred, color='red') # Predicted regression line 6 7# Predicting and plotting a single data point 8x_new = 3.5 # Example new data point 9y_new_pred = predict_y(alpha, beta, x_new) # Prediction for the new data point 10plt.scatter(x_new, y_new_pred, color='green', s=100, zorder=5) # Plotting the new predicted point 11 12plt.xlabel('X') 13plt.ylabel('Y') 14plt.grid(True) 15plt.show()

In this example, we have not only visualized the actual data points against the predicted regression line but also incorporated a step to predict and plot a single data point based on an input (x_new). The green dot represents this new predicted value, distinctively highlighted on the plot to easily discern it from the existing data. This addition vividly demonstrates how new predictions can be made and visualized within the context of the original dataset and regression analysis. Just as we would plug the new value of 3.5 to our formula and predict like this: 2.4=0.4×3.5+12.4 = 0.4 \times 3.5 + 1

Lesson Summary and Practice

Congratulations on successfully deciphering regression analysis! We've unraveled significant insights, implemented a linear regression model, visualized predictions, and evaluated the model. Now, let's reinforce your learning with practice exercises. Go ahead and explore the fascinating world of regression analysis!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.