Lesson 5

Welcome to our lesson on **Linear Regression Analysis**! This technique is fundamental in *machine learning* for predicting values based on data. By the end of this lesson, you will understand what *linear regression* is, why it's useful, and how to create it using Python with the popular `scikit-learn`

library.

Imagine you're running a lemonade stand and want to predict future sales based on past data. Linear regression helps you figure out the trend and make educated guesses. Let's explore how it works.

Linear regression models the relationship between two variables by fitting a straight line to the observed data. The simplest form is **simple linear regression**, where we have one independent variable (input) and one dependent variable (output).

Let's say you have the following data on hours studied and the corresponding test scores:

- Hours studied: [1, 2, 3, 4, 5]
- Test scores: [2, 4, 5, 4, 5]

Our goal is to predict the test score for studying 6 hours. We'll start by visualizing the data:

It is a scatter plot showing the relationship between hours studied and test scores. Now, let's introduce a line to approximate this relationship.

The general formula for a line is: $y = mx + c$ where:

- $y$ is the dependent variable (output we predict, like sales).
- $x$ is the independent variable (input, like days).
- $m$ (slope) determines the line's steepness.
- $c$ (intercept) is where the line crosses the y-axis.

This line helps us understand the trend in data and predict future values.

Different lines can be drawn through the same set of points, but only one will fit the data best. For simplicity, let's draw a few lines and see how they compare visually.

This will plot several lines on the data, helping visualize different possible models. The goal is to find the line that best fits the data points. A better-fitting line will have data points that are closer to it, indicating smaller errors or distances between the observed values and the predicted values. it is easy to visually identify that the blue line is off the data, and the orange line fits it better. But how do we compare the orange line to the red line? They both seem quite good.

The best-fit line minimizes the error between the observed data points and the predicted values. We can use `scikit-learn`

, a powerful machine learning library in Python, to find this line easily.

Let's calculate this in Python:

Python`1from sklearn.linear_model import LinearRegression 2import numpy as np 3 4# Data for hours studied and test scores 5X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # Reshaped for sklearn 6y = np.array([2, 4, 5, 4, 5]) 7 8# Create a Linear Regression model 9model = LinearRegression().fit(X, y) 10 11# Calculate the slope (m) and intercept (c) 12m = model.coef_[0] 13c = model.intercept_ 14 15print(f'Calculated slope (m): {m}') # Calculated slope (m): 0.6 16print(f'Calculated intercept (c): {c}') # Calculated intercept (c): 2.2`

This code will provide the slope and intercept for the best-fit line through the data points using `scikit-learn`

. You will learn more about `scikit-learn`

when you start exploring the Machine Learning. By now, let's quickly review this code:

`LinearRegression()`

creates a linear regression model, which is capable of learning from the data and making predictions.`.fit(X, y)`

method trains the model, finding the perfect line coefficients.`.coef_[0]`

obtains the slope of the best-fit line. The reason we need to use`[0]`

here is that a line could be multidimensional, so the model's`.coef_`

is a list of coefficients. In our two-dimensional case, we will get a list of one coefficient. To get it, we use`[0]`

.`.intercept_`

obtains the intercept of the best-fit line.

Now that we have the slope and intercept, let's plot the best-fit line on the original data:

This line is the best-fit line, it minimizes the average distance between the line and the data points.

With the best-fit line equation $y = 0.6x + 2.2$, we can predict new values. Let's predict the test score for 6 hours of study:

$y(6) = 0.6 \cdot 6 + 2.2 = 3.6 + 2.2 = 5.8$This equation doesn't take into account that scores are limited with $[2, 5]$ interval, so the score of $5.8$ should be treated as $5$.

Let's make a prediction for 0 hours of study:

$y(0) = 0.6 \cdot 0 + 2.2 = 2.2$The best-fit line predicts the score $2$ for zero hours of study.

Fantastic! You've learned the basics of *linear regression*, how to calculate it using `scikit-learn`

, and how to implement it in Python. We've explored predicting a test score based on hours studied by calculating and plotting the best-fit line.

Now it's time to put this knowledge into practice. In the next session, you'll implement linear regression on a new dataset and make predictions. Let's dive into those exercises and solidify your understanding!