Fitting a Linear Regression Model to the Housing Dataset with Sklearn

Introduction to Predictive ModelingLesson 3

Lesson 3

Fitting a Linear Regression Model to the Housing Dataset with Sklearn

Introduction and Basics of Linear Regression Model Fitting

Welcome back! Today, we are delving into the practical application of predictive modeling. This lesson will focus on applying Linear Regression using the California Housing Dataset and Python, to make predictions with real-world data. This time we will be utilizing the powerful sklearn library to simplify our process, instead of implementing linear regression from scratch, this will allow us to efficiently calculate coefficients, plot data and regression lines. So, without further ado, let's dive in!

Data Loading and Preparation

Our journey begins by loading the California Housing Dataset. Here's how we do it with Python:

Python
1from sklearn.datasets import fetch_california_housing
2import matplotlib.pyplot as plt
3
4# Fetching the dataset
5housing = fetch_california_housing()
6# Selecting the Median Income feature, index 0 represents Median Income
7X = housing.data[:, 0]  # Extracting the first column
8X = X.reshape(-1, 1)  # Reshape for a single feature (sklearn expects 2D array)
9Y = housing.target

In this preparation phase, we import necessary libraries and the dataset using fetch_california_housing. The feature we're focusing on is "Median Income", which is the first column (index 0) of the dataset. Initially, we select the Median Income feature directly from the dataset. In the next step, we reshape it to ensure compatibility with sklearn. This process of selecting and reshaping our independent variable, Median Income, prepares our data for the model fitting process.

Fitting the Model

With our data prepared, we can now fit our linear regression model:

Python
1from sklearn.linear_model import LinearRegression
2
3# Creating and training the model
4model = LinearRegression()
5model.fit(X, Y)

Here, we initialize the LinearRegression model and fit it to our data using model.fit(X, Y). This function trains the model by finding the best coefficients that predict our target values from the given features. It does this through an optimization process, minimizing the error between actual and predicted values. Essentially, model.fit enables us to automate the complex steps of learning from data, allowing sklearn to handle the underlying mathematics. This makes fitting the model both accessible and efficient, readying it for predictions without manual intervention.

Data Visualization and Regression Line

Let's visualize how well our model fits the data:

Python
1# Predictions for the dataset
2Y_pred = model.predict(X)
3
4# Plotting actual data points
5plt.scatter(X, Y, color='blue')
6
7# Plotting the regression line
8plt.plot(X, Y_pred, color='red')
9
10plt.title("Linear Regression on California Housing Dataset")
11plt.xlabel('Median Income')
12plt.ylabel('Median House Value')
13plt.grid(True)
14plt.show()

This code helps in visualizing the actual entries from our dataset as blue dots, while the regression line derived from our model is illustrated in red. The model.predict(X) function plays a pivotal role here; it applies the linear regression formula with the learned coefficients during the fitting to the X values (median income). This step translates our trained model's understanding into predictions for Y values (median house value), allowing us to see how the model applies its learned linear relationship to input data.

Visualizing this linear regression line alongside actual data points provides clear insight into the correlation between median income and house values, showcasing how effectively our model captures this relationship.

Making Predictions

Now, we demonstrate the power of our model:

Python
1# Note: sklearn requires input to be a 2D array, thus we convert our single value to 2D using double brackets
2x_new = [[8]]  # Median income normalized, representing $80000.00
3y_new_pred = model.predict(x_new)
4print(f"For a median income of ${x_new[0][0] * 10000:.2f}, the projected median house value is ${y_new_pred[0] * 100000:.2f}")
5# Output: For a median income of $80000.00, the projected median house value is $379436.37

By applying our model to predict the median house value for a specific median income, we observe the practical utility of Linear Regression. This showcases how median income levels can affect housing price predictions, providing us with valuable insights into housing market dynamics.

Next Steps: Evaluating Your Model

It's important to note that forming a model is just a piece of the puzzle. Evaluating its performance is crucial. We shall delve into evaluation techniques in our upcoming lessons to ensure our model's predictions are not only accurate but also reliable for real-world application.

Lesson Summary

Kudos! We've navigated through the steps of loading data, fitting a Linear Regression model with sklearn, visualizing this model against our data, and making predictive insights on the California Housing Dataset. This lesson marks an important milestone in our journey through predictive modeling, arming us with the knowledge to implement, interpret, and assess linear regression models in practice. Stay tuned for more engaging sessions ahead!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.

Predictive Modeling with Python

Introduction to Predictive ModelingLesson 3

Lesson 3