Splitting the Data and Feature Scaling

Lesson 2

Introduction to Data Splitting and Feature Scaling

Welcome to the next step in our journey with the mtcars dataset. In the previous lesson, you learned how to preprocess and explore the mtcars dataset, laying the groundwork for more complex analyses. Now, we'll progress to splitting the data into training and test sets and scaling our features. These steps are crucial in preparing your data for machine learning models.

Step 1: Loading the mtcars Dataset

First, let's start by loading the mtcars dataset. This dataset is included with R, so you don’t need to download anything extra.

R
1# Load the mtcars dataset
2data(mtcars)
3
4# Print the first few rows to ensure it's loaded correctly
5print(head(mtcars))

Output:


1                   mpg cyl disp  hp  drat    wt  qsec vs am gear carb
2Mazda RX4         21.0   6  160 110  3.90 2.620 16.46  0  1    4    4
3Mazda RX4 Wag     21.0   6  160 110  3.90 2.875 17.02  0  1    4    4
4Datsun 710        22.8   4  108  93  3.85 2.320 18.61  1  1    4    1
5Hornet 4 Drive    21.4   6  258 110  3.08 3.215 19.44  1  0    3    1
6Hornet Sportabout 18.7   8  360 175  3.15 3.440 17.02  0  0    3    2
7Valiant           18.1   6  225 105  2.76 3.460 20.22  1  0    3    1

Step 2: Setting a Seed for Reproducibility

Setting a seed ensures that your results can be reproduced by others. This is especially important for random processes.

R
1# Set seed for reproducibility
2set.seed(123)

This code doesn’t produce visible output but is crucial for reproducibility.

Step 3: Convert categorical columns to factors

In this step, we will convert categorical columns in the mtcars dataset to factors. This is important because factors are treated as categorical data in R, enabling more accurate analyses and model training. Specifically, we'll convert the columns am, cyl, vs, gear, and carb to factors.

R
1# Convert categorical columns to factors
2print("Structure of mtcars dataset:")
3print(str(mtcars))
4
5mtcars$am <- as.factor(mtcars$am)
6mtcars$cyl <- as.factor(mtcars$cyl)
7mtcars$vs <- as.factor(mtcars$vs)
8mtcars$gear <- as.factor(mtcars$gear)
9mtcars$carb <- as.factor(mtcars$carb)
10
11print("Verify the structure of mtcars after conversion:")
12print(str(mtcars))

Output:


1[1] "Structure of mtcars dataset:"
2'data.frame':	32 obs. of  11 variables:
3 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
4 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
5 $ disp: num  160 160 108 258 360 ...
6 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
7 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
8 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
9 $ qsec: num  16.5 17 18.6 19.4 17 ...
10 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
11 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
12 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
13 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
14NULL
15[1] "Verify the structure of mtcars after conversion:"
16'data.frame':	32 obs. of  11 variables:
17 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
18 $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
19 $ disp: num  160 160 108 258 360 ...
20 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
21 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
22 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
23 $ qsec: num  16.5 17 18.6 19.4 17 ...
24 $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
25 $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
26 $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
27 $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
28NULL

Step 4: Splitting Data into Training and Testing Sets

Now we'll use the caret library to split the mtcars dataset into training and testing sets. The createDataPartition function from the caret library helps us achieve this. We’ll partition 70% of the data for training and the remaining 30% for testing.

R
1# Splitting data into training and testing sets
2trainIndex <- createDataPartition(mtcars$am, p = 0.7, list = FALSE, times = 1)
3trainData <- mtcars[trainIndex,]
4testData <- mtcars[-trainIndex,]
5
6# Print the number of rows in training and testing sets
7print(nrow(trainData))
8print(nrow(testData))

Output:


1[1] 24  # Number of rows in trainData
2[1] 8   # Number of rows in testData

Step 5: Feature Scaling

Feature scaling is an important step to ensure that all data points are on a similar scale. This is especially important for algorithms that use distance measurements (e.g., K-Nearest Neighbors) or gradient descent optimization.

We'll normalize (center and scale) the features using the preProcess function from the caret library.

R
1# Feature scaling (excluding factor columns)
2print("Train data before feature scaling")
3print(head(trainData))
4
5numericColumns <- sapply(trainData, is.numeric)
6preProcValues <- preProcess(trainData[, numericColumns], method = c("center", "scale"))
7trainData[, numericColumns] <- predict(preProcValues, trainData[, numericColumns])
8testData[, numericColumns] <- predict(preProcValues, testData[, numericColumns])
9
10print("Train data after feature scaling")
11print(head(trainData))

sapply(trainData, is.numeric) identifies numeric columns in trainData.
preProcess(trainData[, numericColumns], method = c("center", "scale")) computes scaling parameters.
predict(preProcValues, trainData[, numericColumns]) applies scaling to the data.

Output:


1[1] "Train data before feature scaling"
2                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
3Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
4Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
5Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
6Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
7Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
8Duster 360        14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
9[1] "Train data after feature scaling"
10                         mpg cyl       disp         hp       drat         wt
11Mazda RX4          0.2252332   6 -0.6421320 -0.6395978  0.6264782 -0.6470675
12Mazda RX4 Wag      0.2252332   6 -0.6421320 -0.6395978  0.6264782 -0.3787348
13Datsun 710         0.5380973   4 -1.0652699 -0.8746931  0.5298492 -0.9627529
14Hornet Sportabout -0.1745376   8  0.9853212  0.2592964 -0.8229572  0.2158061
15Valiant           -0.2788257   6 -0.1132097 -0.7087435 -1.5766637  0.2368518
16Duster 360        -0.9393166   8  0.9853212  1.2273362 -0.7070024  0.3526031
17                        qsec vs am gear carb
18Mazda RX4         -0.6309623  0  1    4    4
19Mazda RX4 Wag     -0.3456281  0  1    4    4
20Datsun 710         0.4645173  1  1    4    1
21Hornet Sportabout -0.3456281  0  0    3    2
22Valiant            1.2848532  1  0    3    1
23Duster 360        -0.9468681  0  0    3    4

Why It Matters

Splitting your dataset and scaling features are crucial steps in building effective machine learning models. By splitting the data, you ensure that your model is trained and tested on different data, which helps in evaluating its real-world performance. Feature scaling brings all features to a similar scale, which is especially important for algorithms that rely on distances (like K-Nearest Neighbors) or gradients (like gradient descent).

Mastering these techniques will significantly improve the accuracy and reliability of your models. These steps may seem straightforward, but they form the backbone of any robust machine learning project.

Are you ready to take the next step? Let's get started with the practice section and put these concepts into action.

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.