Making Predictions and Evaluating Performance

Lesson 4

Welcome back! You've already taken a huge step by building and evaluating a logistic regression model in the previous lesson. Now, let's move forward and see how to make predictions using this model and evaluate its performance.

What You'll Learn

In this lesson, you will:

Make predictions with your trained model using the test data.
Evaluate the performance of the model using a confusion matrix.
Understand what the evaluation results mean for your model.

By the end of this lesson, you will be able to:

Use the predict function in R to generate predictions from your logistic regression model.
Interpret a confusion matrix to understand the performance of your model.

You must be familiar with most of the code shown below from previous units. The prediction step, added here, will be our focus in this lesson:

R
1# Load the mtcars dataset
2data(mtcars)
3
4# Set seed for reproducibility
5set.seed(123)
6
7# Convert categorical columns to factors
8mtcars$am <- as.factor(mtcars$am)
9mtcars$cyl <- as.factor(mtcars$cyl)
10mtcars$vs <- as.factor(mtcars$vs)
11mtcars$gear <- as.factor(mtcars$gear)
12mtcars$carb <- as.factor(mtcars$carb)
13
14# Splitting data into training and testing sets
15trainIndex <- createDataPartition(mtcars$am, p = 0.7, list = FALSE, times = 1)
16trainData <- mtcars[trainIndex,]
17testData <- mtcars[-trainIndex,]
18
19# Feature scaling (excluding factor columns)
20numericColumns <- sapply(trainData, is.numeric)
21preProcValues <- preProcess(trainData[, numericColumns], method = c("center", "scale"))
22trainData[, numericColumns] <- predict(preProcValues, trainData[, numericColumns])
23testData[, numericColumns] <- predict(preProcValues, testData[, numericColumns])
24
25# Train a logistic regression model
26model <- train(am ~ mpg, data = trainData, method = "glm", family = "binomial")
27
28# Making predictions
29predictions <- predict(model, testData)
30
31# Evaluating the model
32confusion <- confusionMatrix(predictions, testData$am)
33print(confusion)

Output:


1Confusion Matrix and Statistics
2
3          Reference
4Prediction 0 1
5         0 4 0
6         1 1 3
7                                          
8               Accuracy : 0.875           
9                 95% CI : (0.4735, 0.9968)
10    No Information Rate : 0.625           
11    P-Value [Acc > NIR] : 0.135           
12                                          
13                  Kappa : 0.75            
14                                          
15 Mcnemar's Test P-Value : 1.000           
16                                          
17            Sensitivity : 0.800           
18            Specificity : 1.000           
19         Pos Pred Value : 1.000           
20         Neg Pred Value : 0.750           
21             Prevalence : 0.625           
22         Detection Rate : 0.500           
23   Detection Prevalence : 0.500           
24      Balanced Accuracy : 0.900           
25                                          
26       'Positive' Class : 0

Understanding the Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. The matrix compares the actual target values to the values predicted by the model.

Here's a breakdown of the confusion matrix components:

True Positives (TP): The number of correct positive predictions.
True Negatives (TN): The number of correct negative predictions.
False Positives (FP): The number of incorrect positive predictions (also known as Type I errors).
False Negatives (FN): The number of incorrect negative predictions (also known as Type II errors).

The confusion matrix helps in calculating metrics such as accuracy, precision, recall, and F1 score, which provide more insight into the performance of your model.

Why It Matters

Making predictions and evaluating your model's performance are essential steps in the machine learning workflow. These steps help you understand how well your model generalizes to unseen data, which is crucial for making reliable decisions in real-world applications.

A confusion matrix, in particular, provides a detailed breakdown of your model's performance by showing the correct and incorrect predictions. This insight allows you to fine-tune your model and improve its accuracy, leading to better predictive performance.

Exciting, right? Let's dive into the practice section and see how well your logistic regression model performs on the test data!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.