Welcome back! You're now ready to build and evaluate machine learning models. You have learned how to preprocess the mtcars dataset and how to split the data into training and testing sets. Now, let's take it a step further and construct a logistic regression model.
In this lesson, you will:
By the end of this lesson, you will be able to:
caret
library in R.Here's a key snippet of the code you'll be working with:
R1# Load the mtcars dataset 2data(mtcars) 3 4# Set seed for reproducibility 5set.seed(123) 6 7# Convert categorical columns to factors 8mtcars$am <- as.factor(mtcars$am) 9mtcars$cyl <- as.factor(mtcars$cyl) 10mtcars$vs <- as.factor(mtcars$vs) 11mtcars$gear <- as.factor(mtcars$gear) 12mtcars$carb <- as.factor(mtcars$carb) 13 14# Splitting data into training and testing sets 15trainIndex <- createDataPartition(mtcars$am, p = 0.7, list = FALSE, times = 1) 16trainData <- mtcars[trainIndex,] 17testData <- mtcars[-trainIndex,] 18 19# Feature scaling (excluding factor columns) 20numericColumns <- sapply(trainData, is.numeric) 21preProcValues <- preProcess(trainData[, numericColumns], method = c("center", "scale")) 22trainData[, numericColumns] <- predict(preProcValues, trainData[, numericColumns]) 23testData[, numericColumns] <- predict(preProcValues, testData[, numericColumns]) 24 25# Train a logistic regression model, and display warnings 26withCallingHandlers({ 27 model <- train(am ~ mpg + hp + wt, data = trainData, method = "glm", family = "binomial") 28}, warning = function(w) { 29 message("Warning: ", conditionMessage(w)) 30 invokeRestart("muffleWarning") 31}) 32 33# Display the model details 34print(model)
Let's understand the train
function parameters in more depth:
am ~ mpg + hp + wt
: This formula specifies that we are trying to predict the am
(transmission) column using mpg
(miles per gallon), hp
(horsepower), and wt
(weight) as predictors.data = trainData
: This specifies the dataset to be used for training the model.method = "glm"
: This indicates that we are using generalized linear models for training.family = "binomial"
: This specifies the family of the model, which in this case is binomial logistic regression since am
is a binary outcome.In the above code, we use withCallingHandlers
to train the model and handle any warnings that might occur during the training process. The withCallingHandlers
function allows us to catch warnings and handle them in a specific way, while still allowing the code to run. In this case, we are capturing warnings as messages and using invokeRestart("muffleWarning")
to suppress them.
The output when displaying the model details is as follows:
1Generalized Linear Model 2 324 samples 4 3 predictor 5 2 classes: '0', '1' 6 7No pre-processing 8Resampling: Bootstrapped (25 reps) 9Summary of sample sizes: 24, 24, 24, 24, 24, 24, ... 10Resampling results: 11 12 Accuracy Kappa 13 0.7824604 0.5547113
Note that the evaluation was performed on the training set using bootstrapped resampling, which is a technique that involves creating multiple training sets by randomly sampling the original data with replacement, and helps provide a more robust estimate of model performance by training the model multiple times on different variations of the data.
To understand the output, let's review the following performance metrics:
When training the model, you may encounter the following warnings:
1Warning: glm.fit: algorithm did not converge 2Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
The first warning, "algorithm did not converge", indicates that the iterative process used to estimate the model parameters did not successfully find a solution. This can happen due to various reasons such as multicollinearity (when predictor variables are highly correlated with each other), insufficient iterations, or extreme data imbalances.
The second warning, "fitted probabilities numerically 0 or 1 occurred", indicates that the model predicted probabilities extremely close to 0 or 1 for some data points, signaling near-perfect separation of the binary outcome based on the predictors. This can occur for various reasons, including potential overfitting, especially with a small dataset like mtcars. Overfitting means the model is memorizing training data rather than learning general patterns. Small sample sizes increase the risk of overfitting and misleadingly optimistic performance metrics. In such cases, you might consider regularization techniques or gathering more data to mitigate these issues, but for now, let's move on and wrap up.
Building and evaluating models is a core part of any machine learning project. Logistic regression, in particular, is a powerful and widely used method for binary classification tasks, such as determining whether a car has an automatic or manual transmission in the mtcars dataset. Mastering this technique will enable you to tackle various real-world problems where classification is essential.
Evaluating your model is equally important as it helps you understand its performance and potential weaknesses. The insights gained from this evaluation will guide you in refining your model and making it more robust.
Ready to see it in action? Let's get started with the practice section and build our logistic regression model!