Welcome back! You're now ready to build and evaluate machine learning models. You have learned how to preprocess the mtcars dataset and how to split the data into training and testing sets. Now, let's take it a step further and construct a logistic regression model.
In this lesson, you will:
- Train a logistic regression model using the mtcars dataset.
- Understand the importance of logistic regression in binary classification tasks.
- Display and interpret model details to evaluate their performance.
- Interpret warnings generated during model training and understand their implications.
By the end of this lesson, you will be able to:
- Build a logistic regression model using the
caret
library in R. - Print and interpret the details of the model, including key performance metrics.
- Explain common warnings that may arise during model training and their significance.
Here's a key snippet of the code you'll be working with:
R1# Load the mtcars dataset 2data(mtcars) 3 4# Set seed for reproducibility 5set.seed(123) 6 7# Convert categorical columns to factors 8mtcars$am <- as.factor(mtcars$am) 9mtcars$cyl <- as.factor(mtcars$cyl) 10mtcars$vs <- as.factor(mtcars$vs) 11mtcars$gear <- as.factor(mtcars$gear) 12mtcars$carb <- as.factor(mtcars$carb) 13 14# Splitting data into training and testing sets 15trainIndex <- createDataPartition(mtcars$am, p = 0.7, list = FALSE, times = 1) 16trainData <- mtcars[trainIndex,] 17testData <- mtcars[-trainIndex,] 18 19# Feature scaling (excluding factor columns) 20numericColumns <- sapply(trainData, is.numeric) 21preProcValues <- preProcess(trainData[, numericColumns], method = c("center", "scale")) 22trainData[, numericColumns] <- predict(preProcValues, trainData[, numericColumns]) 23testData[, numericColumns] <- predict(preProcValues, testData[, numericColumns]) 24 25# Train a logistic regression model, and display warnings 26withCallingHandlers({ 27 model <- train(am ~ mpg + hp + wt, data = trainData, method = "glm", family = "binomial") 28}, warning = function(w) { 29 message("Warning: ", conditionMessage(w)) 30 invokeRestart("muffleWarning") 31}) 32 33# Display the model details 34print(model)
Let's understand the train
function parameters in more depth:
am ~ mpg + hp + wt
: This formula specifies that we are trying to predict theam
(transmission) column usingmpg
(miles per gallon),hp
(horsepower), andwt
(weight) as predictors.data = trainData
: This specifies the dataset to be used for training the model.method = "glm"
: This indicates that we are using generalized linear models for training.family = "binomial"
: This specifies the family of the model, which in this case is binomial logistic regression sinceam
is a binary outcome.
In the above code, we use withCallingHandlers
to train the model and handle any warnings that might occur during the training process. The withCallingHandlers
function allows us to catch warnings and handle them in a specific way, while still allowing the code to run. In this case, we are capturing warnings as messages and using invokeRestart("muffleWarning")
to suppress them.
The output when displaying the model details is as follows:
1Generalized Linear Model 2 324 samples 4 3 predictor 5 2 classes: '0', '1' 6 7No pre-processing 8Resampling: Bootstrapped (25 reps) 9Summary of sample sizes: 24, 24, 24, 24, 24, 24, ... 10Resampling results: 11 12 Accuracy Kappa 13 0.7824604 0.5547113
Note that the evaluation was performed on the training set using bootstrapped resampling, which is a technique that involves creating multiple training sets by randomly sampling the original data with replacement, and helps provide a more robust estimate of model performance by training the model multiple times on different variations of the data.
To understand the output, let's review the following performance metrics:
- Accuracy: This measures the proportion of correct predictions made by the model out of all predictions. For example, an accuracy of 0.78 means the model correctly predicted 78% of the cases.
- Kappa: This adjusts the accuracy to account for the possibility of the agreement occurring by chance. A Kappa value of 1 indicates perfect agreement, while 0 means the agreement is no better than random guessing.
When training the model, you may encounter the following warnings:
1Warning: glm.fit: algorithm did not converge 2Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
The first warning, "algorithm did not converge", indicates that the iterative process used to estimate the model parameters did not successfully find a solution. This can happen due to various reasons such as multicollinearity (when predictor variables are highly correlated with each other), insufficient iterations, or extreme data imbalances.
The second warning, "fitted probabilities numerically 0 or 1 occurred", indicates that the model predicted probabilities extremely close to 0 or 1 for some data points, signaling near-perfect separation of the binary outcome based on the predictors. This can occur for various reasons, including potential overfitting, especially with a small dataset like mtcars. Overfitting means the model is memorizing training data rather than learning general patterns. Small sample sizes increase the risk of overfitting and misleadingly optimistic performance metrics. In such cases, you might consider regularization techniques or gathering more data to mitigate these issues, but for now, let's move on and wrap up.
Building and evaluating models is a core part of any machine learning project. Logistic regression, in particular, is a powerful and widely used method for binary classification tasks, such as determining whether a car has an automatic or manual transmission in the mtcars dataset. Mastering this technique will enable you to tackle various real-world problems where classification is essential.
Evaluating your model is equally important as it helps you understand its performance and potential weaknesses. The insights gained from this evaluation will guide you in refining your model and making it more robust.
Ready to see it in action? Let's get started with the practice section and build our logistic regression model!