This lesson provides a quick refresher on the core concepts of linear regression, focusing on key steps and implementation in Python using sklearn
.
By the end of this lesson, you'll be ready to load datasets, split them, create and train a linear regression model, make predictions, and evaluate the model.
We'll start by loading the diabetes dataset from sklearn
. This dataset contains ten baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements), which were obtained for each of 442 diabetes patients. The target is a quantitative measure of disease progression one year after baseline.
Python1import numpy as np 2from sklearn import datasets 3 4# Load the diabetes dataset 5diabetes = datasets.load_diabetes() 6X = diabetes.data # Features 7y = diabetes.target # Target 8 9print("Features:\n", X[:2]) 10print("Target:\n", y[:2])
Note that we can access features and target of this dataset by using .data
and .target
attributes.
This code prints out the first two rows of the dataset, so we can observe its structure:
Plain text1Features: 2 [[ 0.03807591 0.05068012 0.06169621 0.02187239 -0.0442235 -0.03482076 3 -0.04340085 -0.00259226 0.01990749 -0.01764613] 4 [-0.00188202 -0.04464164 -0.05147406 -0.02632753 -0.00844872 -0.01916334 5 0.07441156 -0.03949338 -0.06833155 -0.09220405]] 6Target: 7 [151. 75.] 8
There is also a shortcut for loading X and y:
Python1X, y = datasets.load_diabetes(return_X_y=True)
The return_X_y=True
parameter allows us to split the dataset when loading. You can use any method you find comfortable.
Next, we'll split our data into training and testing sets, like we did before. As a reminder, we use the train_test_split
function for it.
Python1from sklearn.model_selection import train_test_split 2 3# Split dataset into 80% training and 20% testing 4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 5 6print("Training set size:", X_train.shape) 7print("Testing set size:", X_test.shape)
Output:
Plain text1Training set size: (353, 10) 2Testing set size: (89, 10)
The size of the test set, test_size
, is set to 0.2
, which is 20%. It is common to set the test set size to 20-30%,
Let's create a Linear Regression model and train it:
Python1from sklearn.linear_model import LinearRegression 2 3# Create a Linear Regression model 4model = LinearRegression() 5# Train the model 6model.fit(X_train, y_train)
Using the trained model, let's make predictions on the test set:
Python1# Make predictions on the test set 2y_pred = model.predict(X_test) 3print(y_pred[:5]) # [139.5475584 179.51720835 134.03875572 291.41702925 123.78965872]
We print out the first 5 predictions to observe their values.
Now, we can evaluate the model's performance by using some metric. We will apply the Mean Squared Error (MSE) metric here:
Python1from sklearn.metrics import mean_squared_error 2 3# Evaluate the model using Mean Squared Error (MSE) 4mse = mean_squared_error(y_test, y_pred) 5print("Mean Squared Error: %.2f" % mse)
Output:
Plain text1Mean Squared Error: 2900.13
You've refreshed your knowledge on:
- Loading datasets
- Splitting data into training and testing sets
- Creating and training a linear regression model
- Making predictions
- Evaluating the model using
MSE
Now, you're prepared for the practice session to reinforce these concepts. Let's dive in!