Linear Regression Analysis

Lesson 5

Lesson Introduction

Welcome to our lesson on Linear Regression Analysis! This technique is fundamental in machine learning for predicting values based on data. By the end of this lesson, you will understand what linear regression is, why it's useful, and how to create it using Python with the popular scikit-learn library.

Imagine you're running a lemonade stand and want to predict future sales based on past data. Linear regression helps you figure out the trend and make educated guesses. Let's explore how it works.

Understanding Linear Regression

Linear regression models the relationship between two variables by fitting a straight line to the observed data. The simplest form is simple linear regression, where we have one independent variable (input) and one dependent variable (output).

Real-Life Example

Let's say you have the following data on hours studied and the corresponding test scores:

Hours studied: [1, 2, 3, 4, 5]
Test scores: [2, 4, 5, 4, 5]

Our goal is to predict the test score for studying 6 hours. We'll start by visualizing the data:

It is a scatter plot showing the relationship between hours studied and test scores. Now, let's introduce a line to approximate this relationship.

Plotting Multiple Lines

The general formula for a line is: $y = mx + c$ where:

$y$ is the dependent variable (output we predict, like sales).
$x$ is the independent variable (input, like days).
$m$ (slope) determines the line's steepness.
$c$ (intercept) is where the line crosses the y-axis.

This line helps us understand the trend in data and predict future values.

Different lines can be drawn through the same set of points, but only one will fit the data best. For simplicity, let's draw a few lines and see how they compare visually.

This will plot several lines on the data, helping visualize different possible models. The goal is to find the line that best fits the data points. A better-fitting line will have data points that are closer to it, indicating smaller errors or distances between the observed values and the predicted values. it is easy to visually identify that the blue line is off the data, and the orange line fits it better. But how do we compare the orange line to the red line? They both seem quite good.

Using scikit-learn to Calculate the Best-Fit Line

The best-fit line minimizes the error between the observed data points and the predicted values. We can use scikit-learn, a powerful machine learning library in Python, to find this line easily.

Let's calculate this in Python:

Python
1from sklearn.linear_model import LinearRegression
2import numpy as np
3
4# Data for hours studied and test scores
5X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)  # Reshaped for sklearn
6y = np.array([2, 4, 5, 4, 5])
7
8# Create a Linear Regression model
9model = LinearRegression().fit(X, y)
10
11# Calculate the slope (m) and intercept (c)
12m = model.coef_[0]
13c = model.intercept_
14
15print(f'Calculated slope (m): {m}')  # Calculated slope (m): 0.6
16print(f'Calculated intercept (c): {c}')  # Calculated intercept (c): 2.2

This code will provide the slope and intercept for the best-fit line through the data points using scikit-learn. You will learn more about scikit-learn when you start exploring the Machine Learning. By now, let's quickly review this code:

LinearRegression() creates a linear regression model, which is capable of learning from the data and making predictions.
.fit(X, y) method trains the model, finding the perfect line coefficients.
.coef_[0] obtains the slope of the best-fit line. The reason we need to use [0] here is that a line could be multidimensional, so the model's .coef_ is a list of coefficients. In our two-dimensional case, we will get a list of one coefficient. To get it, we use [0].
.intercept_ obtains the intercept of the best-fit line.

Plot the Line with the Data

Now that we have the slope and intercept, let's plot the best-fit line on the original data:

This line is the best-fit line, it minimizes the average distance between the line and the data points.

Making Predictions for New Values

With the best-fit line equation $y = 0.6x + 2.2$ , we can predict new values. Let's predict the test score for 6 hours of study:

y(6) = 0.6 \cdot 6 + 2.2 = 3.6 + 2.2 = 5.8

This equation doesn't take into account that scores are limited with $[2, 5]$ interval, so the score of $5.8$ should be treated as $5$ .

Let's make a prediction for 0 hours of study:

y(0) = 0.6 \cdot 0 + 2.2 = 2.2

The best-fit line predicts the score $2$ for zero hours of study.

Lesson Summary

Fantastic! You've learned the basics of linear regression, how to calculate it using scikit-learn, and how to implement it in Python. We've explored predicting a test score based on hours studied by calculating and plotting the best-fit line.

Now it's time to put this knowledge into practice. In the next session, you'll implement linear regression on a new dataset and make predictions. Let's dive into those exercises and solidify your understanding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.