Recall of the Linear Regression Basics

Lesson 1

Lesson Introduction

This lesson provides a quick refresher on the core concepts of linear regression, focusing on key steps and implementation in Python using sklearn.

By the end of this lesson, you'll be ready to load datasets, split them, create and train a linear regression model, make predictions, and evaluate the model.

Loading Data

We'll start by loading the diabetes dataset from sklearn. This dataset contains ten baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements), which were obtained for each of 442 diabetes patients. The target is a quantitative measure of disease progression one year after baseline.

Python
1import numpy as np
2from sklearn import datasets
3
4# Load the diabetes dataset
5diabetes = datasets.load_diabetes()
6X = diabetes.data  # Features
7y = diabetes.target  # Target
8
9print("Features:\n", X[:2])
10print("Target:\n", y[:2])

Note that we can access features and target of this dataset by using .data and .target attributes.

This code prints out the first two rows of the dataset, so we can observe its structure:

Plain text
1Features:
2 [[ 0.03807591  0.05068012  0.06169621  0.02187239 -0.0442235  -0.03482076
3  -0.04340085 -0.00259226  0.01990749 -0.01764613]
4 [-0.00188202 -0.04464164 -0.05147406 -0.02632753 -0.00844872 -0.01916334
5   0.07441156 -0.03949338 -0.06833155 -0.09220405]]
6Target:
7 [151.  75.]
8

There is also a shortcut for loading X and y:

Python
1X, y = datasets.load_diabetes(return_X_y=True)

The return_X_y=True parameter allows us to split the dataset when loading. You can use any method you find comfortable.

Splitting the Dataset

Next, we'll split our data into training and testing sets, like we did before. As a reminder, we use the train_test_split function for it.

Python
1from sklearn.model_selection import train_test_split
2
3# Split dataset into 80% training and 20% testing
4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5
6print("Training set size:", X_train.shape)
7print("Testing set size:", X_test.shape)

Output:

Plain text
1Training set size: (353, 10)
2Testing set size: (89, 10)

The size of the test set, test_size, is set to 0.2, which is 20%. It is common to set the test set size to 20-30%,

Creating the Model

Let's create a Linear Regression model and train it:

Python
1from sklearn.linear_model import LinearRegression
2
3# Create a Linear Regression model
4model = LinearRegression()
5# Train the model
6model.fit(X_train, y_train)

Making Predictions

Using the trained model, let's make predictions on the test set:

Python
1# Make predictions on the test set
2y_pred = model.predict(X_test)
3print(y_pred[:5])  # [139.5475584  179.51720835 134.03875572 291.41702925 123.78965872]

We print out the first 5 predictions to observe their values.

Now, we can evaluate the model's performance by using some metric. We will apply the Mean Squared Error (MSE) metric here:

Python
1from sklearn.metrics import mean_squared_error
2
3# Evaluate the model using Mean Squared Error (MSE)
4mse = mean_squared_error(y_test, y_pred)
5print("Mean Squared Error: %.2f" % mse)

Output:

Plain text
1Mean Squared Error: 2900.13

Lesson Summary

You've refreshed your knowledge on:

Loading datasets
Splitting data into training and testing sets
Creating and training a linear regression model
Making predictions
Evaluating the model using MSE

Now, you're prepared for the practice session to reinforce these concepts. Let's dive in!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.