Welcome back! In the last lesson, you learned how to preprocess your data to get it ready for machine learning. We covered converting categorical variables to factors and standardizing numerical features. Now, it's time to move on to the next crucial step: splitting your dataset.
In this lesson, you'll learn the steps to split your dataset into training and testing sets using the caret
package in R. Specifically, you'll:
By the end of this lesson, you'll be able to partition a dataset effectively for better model training and evaluation.
Splitting your dataset is a critical step in building reliable machine learning models. By dividing the data, you can train your model on one subset and test its performance on another unseen subset. This helps in evaluating how well your model will generalize to new, unseen data. Splitting the dataset ensures that your model's performance metrics are unbiased and reflective of its true predictive power.
Let's dive into a practical example using the iris
dataset. We'll use the caret
package to split this dataset.
Reproducibility is paramount in data science. Setting a random seed ensures that you get the same split every time you run the code. This can be crucial while presenting or sharing your results. Here’s how you set a seed in R:
R1set.seed(123) # For reproducibility
We use createDataPartition
from the caret
package to split the iris dataset into training and testing sets. Here’s the breakdown of what each part of the code does:
R1# Create data partition indices for training set (70% of the data) 2trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE, times = 1)
iris$Species
: The column you are partitioning by. This ensures that both sets have a similar distribution of species.p = 0.7
: 70% of the data will go to the training set.list = FALSE, times = 1
: Control parameters to simplify the output and perform the partitioning only once.Next, we use the indices created by createDataPartition
to subset our iris
dataset into training and testing sets:
R1# Create data partition indices for training set (70% of the data) 2trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE, times = 1) 3 4# Subset iris dataset into training and testing sets using indexing 5irisTrain <- iris[trainIndex,] 6irisTest <- iris[-trainIndex,] 7 8# Print dimensions of training and testing sets 9print(dim(irisTrain)) 10print(dim(irisTest)) 11 12# Output: 13# [1] 105 5 14# [1] 45 5
irisTrain <- iris[trainIndex,]
: Selects rows that are indexed by trainIndex
to form the training set.irisTest <- iris[-trainIndex,]
: Uses negative indexing to select all rows not in trainIndex
for the testing set.By printing the dimensions of irisTrain
and irisTest
, we can confirm that irisTrain
contains 70% of the data (105 rows) and irisTest
contains the remaining 30% (45 rows), ensuring our data split is correct and as expected.
By now, you should have a clear understanding of how to split a dataset into training and testing sets using the caret
package. This split is essential for evaluating the performance of your machine learning model in a fair and unbiased manner.
Excited to dive in? Let's proceed to the practice section and start splitting datasets together.