Hello and welcome! In today's lesson, we will learn how to make predictions using a trained Linear Regression model and evaluate the model's performance using the Mean Squared Error (MSE) metric. We will use the diamonds dataset to demonstrate this process.
Before we dive into making predictions, let's briefly recap the steps we took to prepare and train our Linear Regression model.
First, we loaded the diamonds dataset using seaborn
and prepared it by converting categorical variables into dummy variables for numerical compatibility. Next, we selected our features and target variable, and split the data into training and testing sets to ensure our model would generalize well to unseen data. Finally, we created and trained our Linear Regression model:
Python1import seaborn as sns 2import pandas as pd 3from sklearn.model_selection import train_test_split 4from sklearn.linear_model import LinearRegression 5 6# Load the diamonds dataset 7diamonds = sns.load_dataset('diamonds') 8 9# Convert categorical variables to dummy/indicator variables 10diamonds = pd.get_dummies(diamonds, drop_first=True) 11 12# Selecting features and target variable 13X = diamonds.drop('price', axis=1) 14y = diamonds['price'] 15 16# Splitting the data into training and testing sets 17X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 18 19# Creating and training the linear regression model 20model = LinearRegression() 21model.fit(X_train, y_train)
With the trained model ready, we can now move on to making predictions.
To make predictions with our trained model, we use the predict
method provided by the LinearRegression
class. This method will generate predicted values for our test data.
Here’s how to use the predict
method and display the first 10 predictions:
Python1# Making predictions on the test data 2predictions = model.predict(X_test) 3print(predictions[:10]) # Display first 10 predictions for brevity
The output of the above code will be:
Plain text1[ 711.88577262 3191.72583727 1947.2464112 2077.29062598 9878.99820896 2 3932.58482532 2372.62585284 2380.08706701 2844.11827559 6199.23891652]
This output represents the first ten predicted prices of diamonds based on the model. Each number corresponds to the model's prediction of a diamond's price within the test dataset.
By generating predictions, we can now compare these predicted values to the actual values in our test set to evaluate the model's performance.
The Mean Squared Error (MSE) is a metric that measures the average of the squares of the errors — that is, the average squared difference between the predicted values and the actual values. A lower MSE indicates a better fit of the model to the data. In mathematical terms, MSE is defined as: where is the number of observations, are the actual values, and are the predicted values. A lower MSE indicates a better fit of the model to the data, while a higher MSE indicates larger errors in the predictions. However, it’s essential to consider that MSE is sensitive to outliers, and a large error in any prediction can disproportionately affect the MSE value.
To calculate the MSE in code, we use the mean_squared_error
function from the sklearn.metrics
module. Here’s the code to perform this calculation and print the result:
Python1from sklearn.metrics import mean_squared_error 2 3# Calculating the Mean Squared Error 4mse = mean_squared_error(y_test, predictions) 5print(f'Mean Squared Error: {mse}')
The output of the above calculation will be:
Plain text1Mean Squared Error: 1288705.4778516747
This function compares the predicted values with the actual values in the test set and computes the MSE, giving us a sense of the model's accuracy. More strictly, on average, the squared difference between the model's predictions and actual diamond prices is approximately 1,288,705.
This highlights the variability in the model's predictions relative to actual prices, emphasizing areas for improving the model's accuracy.
In this lesson, we learned how to make predictions using a pre-trained Linear Regression model and evaluated the model’s performance using Mean Squared Error (MSE). Understanding prediction and evaluation is crucial in making informed decisions based on model outputs. Keep going, and happy coding!