Lesson 5

Hi there! Today, we're going to learn how to apply **Linear Regression** to a real dataset. Working with real data shows us how *machine learning* solves real problems. We'll use the **California Housing Dataset**. By the end of this lesson, you'll know how to use `Linear Regression`

on a real dataset and understand the results.

Before diving into the code, let's understand the dataset we'll be working with. The **California Housing Dataset** is based on data from the 1990 California census. It contains information about various factors affecting housing prices in different districts of California.

Here's a quick overview of the columns in the dataset:

`MedInc`

: Median income in block group`HouseAge`

: Median house age in block group`AveRooms`

: Average number of rooms per household`AveBedrms`

: Average number of bedrooms per household`Population`

: Block group population`AveOccup`

: Average household size`Latitude`

: Block group latitude`Longitude`

: Block group longitude`MedHouseVal`

: Median house value for California districts (This is our target variable)

First, let's load our data. Think of this step as getting all the ingredients ready before cooking. Here's the code to load the dataset:

Python`1import numpy as np 2import pandas as pd 3from sklearn.datasets import fetch_california_housing 4 5# Load the California Housing dataset 6california = fetch_california_housing(as_frame=True) 7df = california.frame 8 9print(df.head()) # Display the first five rows 10# Output: 11# MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal 12# 0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526 13# 1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585 14# 2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521 15# 3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413 16# 4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422`

We used the `fetch_california_housing`

function to load the dataset and convert it to a Pandas `DataFrame`

for easier handling.

Now, let's select our features and target. In the California Housing Dataset, we'll use all features except for the target column (`MedHouseVal`

).

Here's the code:

Python`1# Drop rows with missing values 2df.dropna(inplace=True) 3 4# Select the features (all except the target column) 5X = df.drop(columns=['MedHouseVal']) 6 7# Select the target column 8y = df['MedHouseVal'] 9 10print(X.head()) 11# Output: 12# MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude 13# 0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 14# 1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 15# 2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 16# 3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 17# 4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 18 19print(y.head()) 20# Output: 21# 0 4.526 22# 1 3.585 23# 2 3.521 24# 3 3.413 25# 4 3.422 26# Name: MedHouseVal, dtype: float64`

We drop rows with missing values and select all features except the target.

Next, we create and train our `Linear Regression`

model. Think of it as teaching a kid to ride a bike: you show them a few times, and then they get the hang of it.

Here's the code:

Python`1from sklearn.linear_model import LinearRegression 2 3# Initialize the linear regression model 4model = LinearRegression() 5 6# Train the model 7model.fit(X, y)`

The `LinearRegression()`

function initializes the model, and `model.fit(X, y)`

trains it using our data.

Once our model is trained, it's ready to make predictions. This is like the kid finally riding the bike on their own.

Here's how you can make predictions:

Python`1# Make predictions 2y_pred = model.predict(X) 3print(y_pred[:5]) # Display the first five predictions 4# Output: 5# [4.13642691 3.62014328 3.39896532 3.41478061 3.92649121]`

The `model.predict(X)`

function uses the model to predict house prices based on the feature values.

It's important to understand how the model is making predictions. In a `Linear Regression`

model, we have an intercept and coefficients for each feature. Think of the intercept as a starting point and the coefficients as slopes.

Here's the code to display them:

Python`1# Display the intercept 2print(f"Intercept: {model.intercept_}") 3# Display the coefficients 4coefficients = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient']) 5print(coefficients) 6# Output: 7# Intercept: -36.94192020749747 8# Coefficient 9# MedInc 0.447808 10# HouseAge 0.011540 11# AveRooms 0.080079 12# AveBedrms -0.144893 13# Population -0.000046 14# AveOccup -0.004605 15# Latitude -0.426464 16# Longitude -0.430478`

The `model.intercept_`

gives us the intercept, and `model.coef_`

gives us the coefficients for each feature.

Finally, we'll calculate the **Mean Squared Error (MSE)** to evaluate how well our model is doing. Think of it as checking if the kid can ride the bike without falling.

Here's the code to calculate MSE and make conclusions:

Python`1from sklearn.metrics import mean_squared_error 2 3# Calculate the Mean Squared Error 4mse = mean_squared_error(y, y_pred) 5print(f"Mean Squared Error: {mse:.4f}") # Mean Squared Error: 0.5308`

The `mean_squared_error`

function computes the MSE, which tells us how close our predictions are to the actual values. A lower MSE indicates a better fit.

Great job! Today, we learned how to apply `Linear Regression`

to a real dataset. We loaded the California Housing Dataset, selected features and target, trained a model, made predictions, and evaluated the results. Understanding how to work with real datasets is a key skill in machine learning.

Now it's your turn! Move on to the practice exercises where you'll apply what you've learned to another real dataset. You'll load data, train a model, make predictions, and visualize the results. Happy coding!