In this lesson, we'll learn how to train a linear regression model. Linear regression helps predict values based on data. Imagine you have house areas and prices and want to predict the price of a house with an unknown area. That's where linear regression helps.
By the end of this lesson, you will understand linear regression, how to generate and handle synthetic data, and how to use Scikit-Learn
to train a linear regression model. You'll also learn to interpret the model's output.
Linear regression models the relationship between two variables by fitting a linear equation to the data. One variable is the explanatory variable, often called "feature" and denoted by and the other is the dependent variable, often called "target" and denoted by .
The linear regression formula is:
Where:
- b is the
y
value whenX
is zero. - k (or coefficient) indicates how much
y
changes for each unit change inX
.
Here is an example of some data and two lines trying to fit the data:
We aim to find the "best-fit" line, which minimizes the difference between the actual data points and the predicted values. It means that we need to find the optimal line parameters k
and b
. This is typically achieved using the least squares method, which finds the line that minimizes the sum of the squared differences between the observed values and the values predicted by the line.
To train our model, we need data. We'll generate synthetic (fake) data for learning, like in the previous lesson.
Python1import numpy as np 2import pandas as pd 3 4np.random.seed(42) 5num_samples = 100 6area = np.random.uniform(500, 3500, num_samples) # House area in square feet 7 8# Assume a linear relationship: price = base_price + (area * price_per_sqft) 9base_price = 50000.00 10price_per_sqft = 200.00 11noise = np.random.normal(0, 25000, num_samples) # Adding noise 12price = base_price + (area * price_per_sqft) + noise 13 14# Output example data 15print(f"Area: {area[:5].round(2)}") 16# Area: [1623.62 3352.14 2695.98 2295.98 968.06] 17print(f"Price: {price[:5].round(2)}") 18# Price: [376900.25 712953.4 591490.38 459505.87 238119.39] 19 20# Create DataFrame 21df = pd.DataFrame({'Area': area.round(2), 'Price': price.round(2)})
We use NumPy
to generate random house areas between 500 and 3500 square feet. The price is calculated based on a base price plus the area multiplied by a fixed rate per square foot, with some random noise.
Again, we use Pandas
to organize our data into a DataFrame, a table-like structure for easier data manipulation.
In machine learning, the inputs are called "features," and the output we want to predict is the "target" variable. In our case, the house area is the feature, and the house price is the target.
Python1# Extract features and target variable 2X = df['Area'].values.reshape(-1, 1) # Feature 3y = df['Price'].values # Target 4 5# Output example features and target values 6print(f"Features (X): {X[:5]}") 7# Features (X): [[1623.62] 8# [3352.14] 9# [2695.98] 10# [2295.98] 11# [ 968.06]] 12print(f"Target (y): {y[:5]}") 13# Target (y): [376900.25 712953.4 591490.38 459505.87 238119.39]
We reshape the features to ensure they're in the correct format for our model. The reshape(-1, 1)
converts X
to a 2D array with one column.
Now it's time to use the Scikit-Learn
library to initialize and train our linear regression model.
Python1from sklearn.linear_model import LinearRegression 2 3# Initialize the model 4model = LinearRegression() 5 6# Train the model 7model.fit(X, y) 8 9# Output fitted model information 10print(f"Model has been trained: Intercept = {model.intercept_:.2f}, Coefficient = {model.coef_[0]:.2f}") 11# Model has been trained: Intercept = 57293.24, Coefficient = 196.17
We first import the LinearRegression
class from Scikit-Learn
. Then we create an instance of the model and train it using the fit
method, which takes in the features X
and the target y
.
After training, we can check the model's coefficients to understand the relationship it found.
Python1# Print model coefficients 2print(f"Intercept: {model.intercept_:.2f}, Coefficients: {model.coef_[0]:.2f}") 3# Intercept = 57293.24, Coefficient = 196.17
The intercept is the point where the line intercepts the y-axis (when the area is zero), and the coefficients are the slope of the line (how much the price changes with one unit of area). In simpler terms, the intercept is the base price, and the coefficients represent the rate per square foot.
Let's plot the data and the fitted line to visualize how well our model fits the data.
Python1import matplotlib.pyplot as plt 2 3# Plot the data points 4plt.scatter(X, y / 1000, alpha=0.5, color='blue', label='Data points') 5 6# Plot the regression line 7plt.plot(X, model.predict(X) / 1000, color='red', label='Regression line') 8 9plt.xlabel('Area (sq ft)') 10plt.ylabel('Price, thousands dollars') 11plt.title('Linear Regression: House Price vs. Area') 12plt.legend() 13plt.grid() 14plt.show()
Using Matplotlib
, we plot the original data points and the regression line. Note that we use the predict
method to create the line. We will discuss this method in detail in the next lesson. It generally takes our features and predicts the target variable for each using the obtained best-fit line.
The blue dots represent the actual data points, and the red line represents our fitted linear regression model:
So far, we've worked with a single feature, the house area. However, linear regression can also handle multiple features.
For example, suppose we have additional features such as the number of bedrooms and the age of the house. Our new dataset might look like this:
Plain text1 price area age num_bedrooms 20 1000.00 500.00 3.00 3.00 31 2500.00 700.00 3.00 3.00 42 3000.00 700.00 2.00 2.00 53 4500.00 800.00 5.00 3.00
Here, X
consists of multiple columns representing different features. The model will now learn a coefficient for each feature, indicating how each impacts the target variable. Though we can't visualize the four-dimensional data, the principle remains: linear regression finds the best-fit line for the given data!
In this lesson, we've covered:
- Understanding linear regression and its purpose.
- Generating synthetic data to simulate real-world house areas and prices.
- Creating a
DataFrame
to organize our data. - Extracting features and target variables for model training.
- Initializing and training a linear regression model using
Scikit-Learn
. - Interpreting the model's output.
Now that you understand the theory and steps involved in training a linear regression model, it's time to put this knowledge into practice. In the practice session, you will get hands-on experience with training a model and making predictions based on new data. Get ready to apply what you've learned and build your own linear regression model!