Applying Linear Regression to the Real Dataset

Lesson 5

Lesson Introduction

Hi there! Today, we're going to learn how to apply Linear Regression to a real dataset. Working with real data shows us how machine learning solves real problems. We'll use the California Housing Dataset. By the end of this lesson, you'll know how to use Linear Regression on a real dataset and understand the results.

Understanding the California Housing Dataset

Before diving into the code, let's understand the dataset we'll be working with. The California Housing Dataset is based on data from the 1990 California census. It contains information about various factors affecting housing prices in different districts of California.

Here's a quick overview of the columns in the dataset:

MedInc: Median income in block group
HouseAge: Median house age in block group
AveRooms: Average number of rooms per household
AveBedrms: Average number of bedrooms per household
Population: Block group population
AveOccup: Average household size
Latitude: Block group latitude
Longitude: Block group longitude
MedHouseVal: Median house value for California districts (This is our target variable)

Loading and Preparing the Data: Part 1

First, let's load our data. Think of this step as getting all the ingredients ready before cooking. Here's the code to load the dataset:

Python
1import numpy as np
2import pandas as pd
3from sklearn.datasets import fetch_california_housing
4
5# Load the California Housing dataset
6california = fetch_california_housing(as_frame=True)
7df = california.frame
8
9print(df.head())  # Display the first five rows
10# Output:
11#    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal
12# 0  8.3252      41.0  6.984127   1.023810      322.0  2.555556     37.88    -122.23        4.526
13# 1  8.3014      21.0  6.238137   0.971880     2401.0  2.109842     37.86    -122.22        3.585
14# 2  7.2574      52.0  8.288136   1.073446      496.0  2.802260     37.85    -122.24        3.521
15# 3  5.6431      52.0  5.817352   1.073059      558.0  2.547945     37.85    -122.25        3.413
16# 4  3.8462      52.0  6.281853   1.081081      565.0  2.181467     37.85    -122.25        3.422

We used the fetch_california_housing function to load the dataset and convert it to a Pandas DataFrame for easier handling.

Loading and Preparing the Data: Part 2

Now, let's select our features and target. In the California Housing Dataset, we'll use all features except for the target column (MedHouseVal).

Here's the code:

Python
1# Drop rows with missing values
2df.dropna(inplace=True)
3
4# Select the features (all except the target column)
5X = df.drop(columns=['MedHouseVal'])
6
7# Select the target column
8y = df['MedHouseVal']
9
10print(X.head())
11# Output:
12#    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude
13# 0  8.3252      41.0  6.984127   1.023810      322.0  2.555556     37.88    -122.23
14# 1  8.3014      21.0  6.238137   0.971880     2401.0  2.109842     37.86    -122.22
15# 2  7.2574      52.0  8.288136   1.073446      496.0  2.802260     37.85    -122.24
16# 3  5.6431      52.0  5.817352   1.073059      558.0  2.547945     37.85    -122.25
17# 4  3.8462      52.0  6.281853   1.081081      565.0  2.181467     37.85    -122.25
18
19print(y.head())
20# Output:
21# 0    4.526
22# 1    3.585
23# 2    3.521
24# 3    3.413
25# 4    3.422
26# Name: MedHouseVal, dtype: float64

We drop rows with missing values and select all features except the target.

Initializing and Training the Model

Next, we create and train our Linear Regression model. Think of it as teaching a kid to ride a bike: you show them a few times, and then they get the hang of it.

Here's the code:

Python
1from sklearn.linear_model import LinearRegression
2
3# Initialize the linear regression model
4model = LinearRegression()
5
6# Train the model
7model.fit(X, y)

The LinearRegression() function initializes the model, and model.fit(X, y) trains it using our data.

Making Predictions

Once our model is trained, it's ready to make predictions. This is like the kid finally riding the bike on their own.

Here's how you can make predictions:

Python
1# Make predictions
2y_pred = model.predict(X)
3print(y_pred[:5])  # Display the first five predictions
4# Output:
5# [4.13642691 3.62014328 3.39896532 3.41478061 3.92649121]

The model.predict(X) function uses the model to predict house prices based on the feature values.

Understanding the Model Outputs

It's important to understand how the model is making predictions. In a Linear Regression model, we have an intercept and coefficients for each feature. Think of the intercept as a starting point and the coefficients as slopes.

Here's the code to display them:

Python
1# Display the intercept
2print(f"Intercept: {model.intercept_}")
3# Display the coefficients
4coefficients = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
5print(coefficients)
6# Output:
7# Intercept: -36.94192020749747
8#             Coefficient
9# MedInc       0.447808
10# HouseAge     0.011540
11# AveRooms     0.080079
12# AveBedrms   -0.144893
13# Population  -0.000046
14# AveOccup    -0.004605
15# Latitude    -0.426464
16# Longitude   -0.430478

The model.intercept_ gives us the intercept, and model.coef_ gives us the coefficients for each feature.

Evaluating the Performance

Finally, we'll calculate the Mean Squared Error (MSE) to evaluate how well our model is doing. Think of it as checking if the kid can ride the bike without falling.

Here's the code to calculate MSE and make conclusions:

Python
1from sklearn.metrics import mean_squared_error
2
3# Calculate the Mean Squared Error
4mse = mean_squared_error(y, y_pred)
5print(f"Mean Squared Error: {mse:.4f}")  # Mean Squared Error: 0.5308

The mean_squared_error function computes the MSE, which tells us how close our predictions are to the actual values. A lower MSE indicates a better fit.

Lesson Summary

Great job! Today, we learned how to apply Linear Regression to a real dataset. We loaded the California Housing Dataset, selected features and target, trained a model, made predictions, and evaluated the results. Understanding how to work with real datasets is a key skill in machine learning.

Now it's your turn! Move on to the practice exercises where you'll apply what you've learned to another real dataset. You'll load data, train a model, make predictions, and visualize the results. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.