Making Predictions and Evaluating Model

Lesson 3

Topic Overview

Hello and welcome! In today's lesson, we will learn how to make predictions using a trained Linear Regression model and evaluate the model's performance using the Mean Squared Error (MSE) metric. We will use the diamonds dataset to demonstrate this process.

Recap of the Trained Model

Before we dive into making predictions, let's briefly recap the steps we took to prepare and train our Linear Regression model.

First, we loaded the diamonds dataset using seaborn and prepared it by converting categorical variables into dummy variables for numerical compatibility. Next, we selected our features and target variable, and split the data into training and testing sets to ensure our model would generalize well to unseen data. Finally, we created and trained our Linear Regression model:

Python
1import seaborn as sns
2import pandas as pd
3from sklearn.model_selection import train_test_split
4from sklearn.linear_model import LinearRegression
5
6# Load the diamonds dataset
7diamonds = sns.load_dataset('diamonds')
8
9# Convert categorical variables to dummy/indicator variables
10diamonds = pd.get_dummies(diamonds, drop_first=True)
11
12# Selecting features and target variable
13X = diamonds.drop('price', axis=1)
14y = diamonds['price']
15
16# Splitting the data into training and testing sets
17X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
18
19# Creating and training the linear regression model
20model = LinearRegression()
21model.fit(X_train, y_train)

With the trained model ready, we can now move on to making predictions.

Making Predictions on Test Data

To make predictions with our trained model, we use the predict method provided by the LinearRegression class. This method will generate predicted values for our test data.

Here’s how to use the predict method and display the first 10 predictions:

Python
1# Making predictions on the test data
2predictions = model.predict(X_test)
3print(predictions[:10])  # Display first 10 predictions for brevity

The output of the above code will be:

Plain text
1[ 711.88577262 3191.72583727 1947.2464112  2077.29062598 9878.99820896
2 3932.58482532 2372.62585284 2380.08706701 2844.11827559 6199.23891652]

This output represents the first ten predicted prices of diamonds based on the model. Each number corresponds to the model's prediction of a diamond's price within the test dataset.

By generating predictions, we can now compare these predicted values to the actual values in our test set to evaluate the model's performance.

Calculating and Understanding Mean Squared Error (MSE)

The Mean Squared Error (MSE) is a metric that measures the average of the squares of the errors — that is, the average squared difference between the predicted values and the actual values. A lower MSE indicates a better fit of the model to the data. In mathematical terms, MSE is defined as: $\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ where $n$ is the number of observations, $y_i$ are the actual values, and $\hat{y}_i$ are the predicted values. A lower MSE indicates a better fit of the model to the data, while a higher MSE indicates larger errors in the predictions. However, it’s essential to consider that MSE is sensitive to outliers, and a large error in any prediction can disproportionately affect the MSE value.

To calculate the MSE in code, we use the mean_squared_error function from the sklearn.metrics module. Here’s the code to perform this calculation and print the result:

Python
1from sklearn.metrics import mean_squared_error
2
3# Calculating the Mean Squared Error
4mse = mean_squared_error(y_test, predictions)
5print(f'Mean Squared Error: {mse}')

The output of the above calculation will be:

Plain text
1Mean Squared Error: 1288705.4778516747

This function compares the predicted values with the actual values in the test set and computes the MSE, giving us a sense of the model's accuracy. More strictly, on average, the squared difference between the model's predictions and actual diamond prices is approximately 1,288,705.

This highlights the variability in the model's predictions relative to actual prices, emphasizing areas for improving the model's accuracy.

Lesson Summary

In this lesson, we learned how to make predictions using a pre-trained Linear Regression model and evaluated the model’s performance using Mean Squared Error (MSE). Understanding prediction and evaluation is crucial in making informed decisions based on model outputs. Keep going, and happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.