Splitting the Dataset

Lesson 2

Welcome to Splitting the Dataset

Welcome back! In the last lesson, you learned how to preprocess your data to get it ready for machine learning. We covered converting categorical variables to factors and standardizing numerical features. Now, it's time to move on to the next crucial step: splitting your dataset.

What You'll Learn

In this lesson, you'll learn the steps to split your dataset into training and testing sets using the caret package in R. Specifically, you'll:

Understand the importance of training and testing data.
Split your dataset for reliable model evaluation.
Ensure reproducibility in your splits.

By the end of this lesson, you'll be able to partition a dataset effectively for better model training and evaluation.

Why It Matters

Splitting your dataset is a critical step in building reliable machine learning models. By dividing the data, you can train your model on one subset and test its performance on another unseen subset. This helps in evaluating how well your model will generalize to new, unseen data. Splitting the dataset ensures that your model's performance metrics are unbiased and reflective of its true predictive power.

Splitting the Dataset with caret

Let's dive into a practical example using the iris dataset. We'll use the caret package to split this dataset.

Step 1: Set a Seed for Reproducibility

Reproducibility is paramount in data science. Setting a random seed ensures that you get the same split every time you run the code. This can be crucial while presenting or sharing your results. Here’s how you set a seed in R:

R
1set.seed(123) # For reproducibility

Step 2: Splitting Data into Train and Test Sets

We use createDataPartition from the caret package to split the iris dataset into training and testing sets. Here’s the breakdown of what each part of the code does:

R
1# Create data partition indices for training set (70% of the data)
2trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE, times = 1)

iris$Species: The column you are partitioning by. This ensures that both sets have a similar distribution of species.
p = 0.7: 70% of the data will go to the training set.
list = FALSE, times = 1: Control parameters to simplify the output and perform the partitioning only once.

Next, we use the indices created by createDataPartition to subset our iris dataset into training and testing sets:

R
1# Create data partition indices for training set (70% of the data)
2trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE, times = 1)
3
4# Subset iris dataset into training and testing sets using indexing
5irisTrain <- iris[trainIndex,]
6irisTest  <- iris[-trainIndex,]
7
8# Print dimensions of training and testing sets
9print(dim(irisTrain))
10print(dim(irisTest))
11
12# Output:
13# [1] 105   5
14# [1] 45  5

irisTrain <- iris[trainIndex,]: Selects rows that are indexed by trainIndex to form the training set.
irisTest <- iris[-trainIndex,]: Uses negative indexing to select all rows not in trainIndex for the testing set.

By printing the dimensions of irisTrain and irisTest, we can confirm that irisTrain contains 70% of the data (105 rows) and irisTest contains the remaining 30% (45 rows), ensuring our data split is correct and as expected.

Conclusion

By now, you should have a clear understanding of how to split a dataset into training and testing sets using the caret package. This split is essential for evaluating the performance of your machine learning model in a fair and unbiased manner.

Excited to dive in? Let's proceed to the practice section and start splitting datasets together.

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.