Lesson 2
Training a Linear Regression Model
Lesson Introduction

In this lesson, we'll learn how to train a linear regression model. Linear regression helps predict values based on data. Imagine you have house areas and prices and want to predict the price of a house with an unknown area. That's where linear regression helps.

By the end of this lesson, you will understand linear regression, how to generate and handle synthetic data, and how to use Scikit-Learn to train a linear regression model. You'll also learn to interpret the model's output.

Understanding Linear Regression

Linear regression models the relationship between two variables by fitting a linear equation to the data. One variable is the explanatory variable, often called "feature" and denoted by XX and the other is the dependent variable, often called "target" and denoted by yy.

The linear regression formula is:

y=kX+by = kX + b

Where:

  • b is the y value when X is zero.
  • k (or coefficient) indicates how much y changes for each unit change in X.

Here is an example of some data and two lines trying to fit the data:

We aim to find the "best-fit" line, which minimizes the difference between the actual data points and the predicted values. It means that we need to find the optimal line parameters k and b. This is typically achieved using the least squares method, which finds the line that minimizes the sum of the squared differences between the observed values and the values predicted by the line.

Generating Synthetic Data

To train our model, we need data. We'll generate synthetic (fake) data for learning, like in the previous lesson.

Python
1import numpy as np 2import pandas as pd 3 4np.random.seed(42) 5num_samples = 100 6area = np.random.uniform(500, 3500, num_samples) # House area in square feet 7 8# Assume a linear relationship: price = base_price + (area * price_per_sqft) 9base_price = 50000.00 10price_per_sqft = 200.00 11noise = np.random.normal(0, 25000, num_samples) # Adding noise 12price = base_price + (area * price_per_sqft) + noise 13 14# Output example data 15print(f"Area: {area[:5].round(2)}") 16# Area: [1623.62 3352.14 2695.98 2295.98 968.06] 17print(f"Price: {price[:5].round(2)}") 18# Price: [376900.25 712953.4 591490.38 459505.87 238119.39] 19 20# Create DataFrame 21df = pd.DataFrame({'Area': area.round(2), 'Price': price.round(2)})

We use NumPy to generate random house areas between 500 and 3500 square feet. The price is calculated based on a base price plus the area multiplied by a fixed rate per square foot, with some random noise.

Again, we use Pandas to organize our data into a DataFrame, a table-like structure for easier data manipulation.

Extracting Features and Target Variables

In machine learning, the inputs are called "features," and the output we want to predict is the "target" variable. In our case, the house area is the feature, and the house price is the target.

Python
1# Extract features and target variable 2X = df['Area'].values.reshape(-1, 1) # Feature 3y = df['Price'].values # Target 4 5# Output example features and target values 6print(f"Features (X): {X[:5]}") 7# Features (X): [[1623.62] 8# [3352.14] 9# [2695.98] 10# [2295.98] 11# [ 968.06]] 12print(f"Target (y): {y[:5]}") 13# Target (y): [376900.25 712953.4 591490.38 459505.87 238119.39]

We reshape the features to ensure they're in the correct format for our model. The reshape(-1, 1) converts X to a 2D array with one column.

Initializing and Training the Model

Now it's time to use the Scikit-Learn library to initialize and train our linear regression model.

Python
1from sklearn.linear_model import LinearRegression 2 3# Initialize the model 4model = LinearRegression() 5 6# Train the model 7model.fit(X, y) 8 9# Output fitted model information 10print(f"Model has been trained: Intercept = {model.intercept_:.2f}, Coefficient = {model.coef_[0]:.2f}") 11# Model has been trained: Intercept = 57293.24, Coefficient = 196.17

We first import the LinearRegression class from Scikit-Learn. Then we create an instance of the model and train it using the fit method, which takes in the features X and the target y.

Interpreting the Model Output

After training, we can check the model's coefficients to understand the relationship it found.

Python
1# Print model coefficients 2print(f"Intercept: {model.intercept_:.2f}, Coefficients: {model.coef_[0]:.2f}") 3# Intercept = 57293.24, Coefficient = 196.17

The intercept is the point where the line intercepts the y-axis (when the area is zero), and the coefficients are the slope of the line (how much the price changes with one unit of area). In simpler terms, the intercept is the base price, and the coefficients represent the rate per square foot.

Visualizing the Model Fit

Let's plot the data and the fitted line to visualize how well our model fits the data.

Python
1import matplotlib.pyplot as plt 2 3# Plot the data points 4plt.scatter(X, y / 1000, alpha=0.5, color='blue', label='Data points') 5 6# Plot the regression line 7plt.plot(X, model.predict(X) / 1000, color='red', label='Regression line') 8 9plt.xlabel('Area (sq ft)') 10plt.ylabel('Price, thousands dollars') 11plt.title('Linear Regression: House Price vs. Area') 12plt.legend() 13plt.grid() 14plt.show()

Using Matplotlib, we plot the original data points and the regression line. Note that we use the predict method to create the line. We will discuss this method in detail in the next lesson. It generally takes our features and predicts the target variable for each using the obtained best-fit line.

The blue dots represent the actual data points, and the red line represents our fitted linear regression model:

Multiple Features

So far, we've worked with a single feature, the house area. However, linear regression can also handle multiple features.

For example, suppose we have additional features such as the number of bedrooms and the age of the house. Our new dataset might look like this:

Plain text
1 price area age num_bedrooms 20 1000.00 500.00 3.00 3.00 31 2500.00 700.00 3.00 3.00 42 3000.00 700.00 2.00 2.00 53 4500.00 800.00 5.00 3.00

Here, X consists of multiple columns representing different features. The model will now learn a coefficient for each feature, indicating how each impacts the target variable. Though we can't visualize the four-dimensional data, the principle remains: linear regression finds the best-fit line for the given data!

Lesson Summary

In this lesson, we've covered:

  1. Understanding linear regression and its purpose.
  2. Generating synthetic data to simulate real-world house areas and prices.
  3. Creating a DataFrame to organize our data.
  4. Extracting features and target variables for model training.
  5. Initializing and training a linear regression model using Scikit-Learn.
  6. Interpreting the model's output.

Now that you understand the theory and steps involved in training a linear regression model, it's time to put this knowledge into practice. In the practice session, you will get hands-on experience with training a model and making predictions based on new data. Get ready to apply what you've learned and build your own linear regression model!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.