Welcome back! You've already taken a huge step by building and evaluating a logistic regression model in the previous lesson. Now, let's move forward and see how to make predictions using this model and evaluate its performance.
In this lesson, you will:
- Make predictions with your trained model using the test data.
- Evaluate the performance of the model using a
confusion matrix
. - Understand what the evaluation results mean for your model.
By the end of this lesson, you will be able to:
- Use the
predict
function in R to generate predictions from your logistic regression model. - Interpret a
confusion matrix
to understand the performance of your model.
You must be familiar with most of the code shown below from previous units. The prediction step, added here, will be our focus in this lesson:
R1# Load the mtcars dataset 2data(mtcars) 3 4# Set seed for reproducibility 5set.seed(123) 6 7# Convert categorical columns to factors 8mtcars$am <- as.factor(mtcars$am) 9mtcars$cyl <- as.factor(mtcars$cyl) 10mtcars$vs <- as.factor(mtcars$vs) 11mtcars$gear <- as.factor(mtcars$gear) 12mtcars$carb <- as.factor(mtcars$carb) 13 14# Splitting data into training and testing sets 15trainIndex <- createDataPartition(mtcars$am, p = 0.7, list = FALSE, times = 1) 16trainData <- mtcars[trainIndex,] 17testData <- mtcars[-trainIndex,] 18 19# Feature scaling (excluding factor columns) 20numericColumns <- sapply(trainData, is.numeric) 21preProcValues <- preProcess(trainData[, numericColumns], method = c("center", "scale")) 22trainData[, numericColumns] <- predict(preProcValues, trainData[, numericColumns]) 23testData[, numericColumns] <- predict(preProcValues, testData[, numericColumns]) 24 25# Train a logistic regression model 26model <- train(am ~ mpg, data = trainData, method = "glm", family = "binomial") 27 28# Making predictions 29predictions <- predict(model, testData) 30 31# Evaluating the model 32confusion <- confusionMatrix(predictions, testData$am) 33print(confusion)
Output:
1Confusion Matrix and Statistics 2 3 Reference 4Prediction 0 1 5 0 4 0 6 1 1 3 7 8 Accuracy : 0.875 9 95% CI : (0.4735, 0.9968) 10 No Information Rate : 0.625 11 P-Value [Acc > NIR] : 0.135 12 13 Kappa : 0.75 14 15 Mcnemar's Test P-Value : 1.000 16 17 Sensitivity : 0.800 18 Specificity : 1.000 19 Pos Pred Value : 1.000 20 Neg Pred Value : 0.750 21 Prevalence : 0.625 22 Detection Rate : 0.500 23 Detection Prevalence : 0.500 24 Balanced Accuracy : 0.900 25 26 'Positive' Class : 0
A confusion matrix
is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. The matrix compares the actual target values to the values predicted by the model.
Here's a breakdown of the confusion matrix components:
- True Positives (TP): The number of correct positive predictions.
- True Negatives (TN): The number of correct negative predictions.
- False Positives (FP): The number of incorrect positive predictions (also known as Type I errors).
- False Negatives (FN): The number of incorrect negative predictions (also known as Type II errors).
The confusion matrix helps in calculating metrics such as accuracy, precision, recall, and F1 score, which provide more insight into the performance of your model.
Making predictions and evaluating your model's performance are essential steps in the machine learning workflow. These steps help you understand how well your model generalizes to unseen data, which is crucial for making reliable decisions in real-world applications.
A confusion matrix
, in particular, provides a detailed breakdown of your model's performance by showing the correct and incorrect predictions. This insight allows you to fine-tune your model and improve its accuracy, leading to better predictive performance.
Exciting, right? Let's dive into the practice section and see how well your logistic regression model performs on the test data!