Welcome to the next step in our journey with the mtcars dataset. In the previous lesson, you learned how to preprocess and explore the mtcars dataset, laying the groundwork for more complex analyses. Now, we'll progress to splitting the data into training and test sets and scaling our features. These steps are crucial in preparing your data for machine learning models.
First, let's start by loading the mtcars dataset. This dataset is included with R, so you don’t need to download anything extra.
R1# Load the mtcars dataset 2data(mtcars) 3 4# Print the first few rows to ensure it's loaded correctly 5print(head(mtcars))
Output:
1 mpg cyl disp hp drat wt qsec vs am gear carb 2Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 4Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 5Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 7Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Setting a seed ensures that your results can be reproduced by others. This is especially important for random processes.
R1# Set seed for reproducibility 2set.seed(123)
This code doesn’t produce visible output but is crucial for reproducibility.
In this step, we will convert categorical columns in the mtcars dataset to factors. This is important because factors are treated as categorical data in R, enabling more accurate analyses and model training. Specifically, we'll convert the columns am
, cyl
, vs
, gear
, and carb
to factors.
R1# Convert categorical columns to factors 2print("Structure of mtcars dataset:") 3print(str(mtcars)) 4 5mtcars$am <- as.factor(mtcars$am) 6mtcars$cyl <- as.factor(mtcars$cyl) 7mtcars$vs <- as.factor(mtcars$vs) 8mtcars$gear <- as.factor(mtcars$gear) 9mtcars$carb <- as.factor(mtcars$carb) 10 11print("Verify the structure of mtcars after conversion:") 12print(str(mtcars))
Output:
1[1] "Structure of mtcars dataset:" 2'data.frame': 32 obs. of 11 variables: 3 $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... 4 $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... 5 $ disp: num 160 160 108 258 360 ... 6 $ hp : num 110 110 93 110 175 105 245 62 95 123 ... 7 $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... 8 $ wt : num 2.62 2.88 2.32 3.21 3.44 ... 9 $ qsec: num 16.5 17 18.6 19.4 17 ... 10 $ vs : num 0 0 1 1 0 1 0 1 1 1 ... 11 $ am : num 1 1 1 0 0 0 0 0 0 0 ... 12 $ gear: num 4 4 4 3 3 3 3 4 4 4 ... 13 $ carb: num 4 4 1 1 2 1 4 2 2 4 ... 14NULL 15[1] "Verify the structure of mtcars after conversion:" 16'data.frame': 32 obs. of 11 variables: 17 $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... 18 $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ... 19 $ disp: num 160 160 108 258 360 ... 20 $ hp : num 110 110 93 110 175 105 245 62 95 123 ... 21 $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... 22 $ wt : num 2.62 2.88 2.32 3.21 3.44 ... 23 $ qsec: num 16.5 17 18.6 19.4 17 ... 24 $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ... 25 $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ... 26 $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ... 27 $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ... 28NULL
Now we'll use the caret
library to split the mtcars dataset into training and testing sets. The createDataPartition
function from the caret
library helps us achieve this. We’ll partition 70% of the data for training and the remaining 30% for testing.
R1# Splitting data into training and testing sets 2trainIndex <- createDataPartition(mtcars$am, p = 0.7, list = FALSE, times = 1) 3trainData <- mtcars[trainIndex,] 4testData <- mtcars[-trainIndex,] 5 6# Print the number of rows in training and testing sets 7print(nrow(trainData)) 8print(nrow(testData))
Output:
1[1] 24 # Number of rows in trainData 2[1] 8 # Number of rows in testData
Feature scaling is an important step to ensure that all data points are on a similar scale. This is especially important for algorithms that use distance measurements (e.g., K-Nearest Neighbors) or gradient descent optimization.
We'll normalize (center and scale) the features using the preProcess
function from the caret
library.
R1# Feature scaling (excluding factor columns) 2print("Train data before feature scaling") 3print(head(trainData)) 4 5numericColumns <- sapply(trainData, is.numeric) 6preProcValues <- preProcess(trainData[, numericColumns], method = c("center", "scale")) 7trainData[, numericColumns] <- predict(preProcValues, trainData[, numericColumns]) 8testData[, numericColumns] <- predict(preProcValues, testData[, numericColumns]) 9 10print("Train data after feature scaling") 11print(head(trainData))
sapply(trainData, is.numeric)
identifies numeric columns intrainData
.preProcess(trainData[, numericColumns], method = c("center", "scale"))
computes scaling parameters.predict(preProcValues, trainData[, numericColumns])
applies scaling to the data.
Output:
1[1] "Train data before feature scaling" 2 mpg cyl disp hp drat wt qsec vs am gear carb 3Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 4Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 6Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 7Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 8Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 9[1] "Train data after feature scaling" 10 mpg cyl disp hp drat wt 11Mazda RX4 0.2252332 6 -0.6421320 -0.6395978 0.6264782 -0.6470675 12Mazda RX4 Wag 0.2252332 6 -0.6421320 -0.6395978 0.6264782 -0.3787348 13Datsun 710 0.5380973 4 -1.0652699 -0.8746931 0.5298492 -0.9627529 14Hornet Sportabout -0.1745376 8 0.9853212 0.2592964 -0.8229572 0.2158061 15Valiant -0.2788257 6 -0.1132097 -0.7087435 -1.5766637 0.2368518 16Duster 360 -0.9393166 8 0.9853212 1.2273362 -0.7070024 0.3526031 17 qsec vs am gear carb 18Mazda RX4 -0.6309623 0 1 4 4 19Mazda RX4 Wag -0.3456281 0 1 4 4 20Datsun 710 0.4645173 1 1 4 1 21Hornet Sportabout -0.3456281 0 0 3 2 22Valiant 1.2848532 1 0 3 1 23Duster 360 -0.9468681 0 0 3 4
Splitting your dataset and scaling features are crucial steps in building effective machine learning models. By splitting the data, you ensure that your model is trained and tested on different data, which helps in evaluating its real-world performance. Feature scaling brings all features to a similar scale, which is especially important for algorithms that rely on distances (like K-Nearest Neighbors) or gradients (like gradient descent).
Mastering these techniques will significantly improve the accuracy and reliability of your models. These steps may seem straightforward, but they form the backbone of any robust machine learning project.
Are you ready to take the next step? Let's get started with the practice section and put these concepts into action.