Visualizing Model Results and Feature Importance

Lesson 5

What You'll Learn

Great to have you back! In the previous lesson, you built and evaluated a logistic regression model. Now, it's time to take another exciting step. This lesson focuses on visualizing the results of your model and understanding the importance of different features.

In this lesson, you will:

Visualize the logistic regression coefficients.
Identify which features are most important in your model.
Learn to create informative plots using the ggplot2 package in R.

By the end of this lesson, you'll be able to create visualizations that highlight the significant variables in your model and gain insights from it.

Why It Matters

Visualizing model results is crucial for multiple reasons:

Interpretability: Visualization helps you and others understand how the model makes predictions. You can see which features have the most influence, making your model more transparent.
Communication: Clear visual representations make it easier to present your findings to non-technical stakeholders. This is often key to gaining buy-in and moving projects forward.
Model Improvement: By understanding feature importance, you can make more informed decisions about which features to focus on or remove, leading to better model performance.

These skills are essential for any data scientist aiming to make real-world impacts with their models.

Example Code to Get You Started

You must be familiar with most of the code shown below from previous units. The visualization step, added here, will be our focus in this lesson:

R
1# Load the mtcars dataset
2data(mtcars)
3
4# Set seed for reproducibility
5set.seed(123)  
6
7# Convert categorical columns to factors
8mtcars$am <- as.factor(mtcars$am)
9mtcars$cyl <- as.factor(mtcars$cyl)
10mtcars$vs <- as.factor(mtcars$vs)
11mtcars$gear <- as.factor(mtcars$gear)
12mtcars$carb <- as.factor(mtcars$carb)
13
14# Splitting data into training and testing sets
15trainIndex <- createDataPartition(mtcars$am, p = 0.7, list = FALSE, times = 1)
16trainData <- mtcars[trainIndex,]
17testData <- mtcars[-trainIndex,]
18
19# Feature scaling (excluding factor columns)
20numericColumns <- sapply(trainData, is.numeric)
21preProcValues <- preProcess(trainData[, numericColumns], method = c("center", "scale"))
22trainData[, numericColumns] <- predict(preProcValues, trainData[, numericColumns])
23testData[, numericColumns] <- predict(preProcValues, testData[, numericColumns])
24
25# Train a logistic regression model, and display warnings
26withCallingHandlers({
27    model <- train(am ~ ., data = trainData, method = "glm", family = "binomial")
28}, warning = function(w) {
29    message("Warning: ", conditionMessage(w))
30    invokeRestart("muffleWarning")
31})
32
33# Visualizing the logistic regression coefficients
34coef_df <- as.data.frame(coef(summary(model$finalModel)))
35coef_df$Variable <- rownames(coef_df)
36names(coef_df)[1] <- "Estimate"
37
38scatter_plot <- ggplot(coef_df, aes(x = reorder(Variable, Estimate), y = Estimate)) +
39  geom_bar(stat = 'identity') +
40  coord_flip() +
41  theme_light() +
42  labs(title = "Variable Importance (Logistic Regression)",
43       x = "Variables",
44       y = "Estimate")

Output:

Note that when running the code you might notice a new warning: "Warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases." This occurs because we are using the formula am ~ ., which includes all predictors in the dataset. Our mtcars dataset is very small, which can lead to issues with collinearity or insufficient variability among the predictors. These issues can cause the model to produce rank-deficient fits, meaning some coefficients could not be estimated reliably. You can observe that if you use fewer predictors, such as am ~ mpg + hp + wt, you no longer get this warning.

Here is a line by line explanation of the visualization code:

coef_df <- as.data.frame(coef(summary(model$finalModel))) converts the logistic regression coefficients into a data frame.
coef_df$Variable <- rownames(coef_df) names(coef_df)[1] <- "Estimate" adds the coefficient names to a Variable column and renames the first column to Estimate.
scatter_plot <- ggplot(coef_df, aes(x = reorder(Variable, Estimate), y = Estimate)) initializes a ggplot with Variable on the x-axis (reordered by Estimate) and Estimate on the y-axis.
geom_bar(stat = 'identity') adds bars representing the Estimate values.
coord_flip() flips the axes to make the bars horizontal.
theme_light() applies a light theme for better visual clarity.
labs(...) sets the title and axis labels.

Excited to dive in? Let's move on to the practice section and bring your model results to life through visualization!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.