Train-Test Split

Lesson 4

Train-Test Split

Lesson Introduction

Imagine you built a robot to recognize apples and oranges. But how do you know if it's good at this task? You need to test it on some new apples and oranges it hasn’t seen before. In machine learning, we do something similar by splitting our data into training and test sets. This helps us see how well our model performs on new data.

It helps indicating and preventing overfitting. Overfitting is when a machine learning model learns the training data too well, including noise and details that don’t apply to new data. This results in excellent performance on the training set but poor performance on the test set, indicating that the model has memorized specifics rather than understanding general patterns.

Today, we will learn how to split a dataset into training and test sets using the train_test_split function from SciKit Learn. By the end of this lesson, you'll know how to prepare your data properly to evaluate your model.

What is a Train-Test Split

A train-test split is cutting the dataset into two parts: one to train the model and one to test it. The training set helps the model learn patterns, and the test set helps us check if the model is good at predicting new data.

For example, if you have 10 pictures of fruits, you might use 8 to train your robot and 2 to test it. This ensures the robot hasn’t memorized the training pictures but can recognize new ones too.

The `train_test_split` Function

To split the data, we use the train_test_split function from the SciKit Learn library. This function makes it easy to divide your data randomly. Let’s first see how to import what we need:

Python
1from sklearn.model_selection import train_test_split

Small Dataset

Let's use a very small dataset. Imagine we have 10 fruit images (features) and their labels (like apple or orange). Here is our dataset:

Python
1# Small dataset
2X = [[0.1], [0.2], [0.1], [0.5], [0.5], [0.2], [0.2], [0.4], [0.1], [0.2]]  # 10 features
3y = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0]  # 10 target labels

In this example, X is your features (like fruit images), and y is your target labels (like 0 for 'apple' or 1 for 'orange').

Splitting the Dataset

Now, let's use the train_test_split function to divide our dataset into training and test sets:

Python
1X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
2print(len(X_train))  # 8
3print(len(X_test))  # 2
4print(len(y_train))  # 8
5print(len(y_test))  # 2

Here’s what this does:

X_train and y_train are the training sets.
X_test and y_test are the test sets.
test_size=0.2 means 20% of the data is for testing, and 80% is for training. It is common to use 20-30% of your data for the test set.
random_state=42 ensures the split is the same every time you run the code, which is handy for reproducibility. You can use any integer in random_state, 42 is just a random choice (or a reference to some book 😉)

Lesson Summary

In this lesson, we learned why it's important to split our data into training and test sets. We discussed overfitting, which is like memorizing homework answers but failing the test. We then explored the train_test_split function, used a small dataset, and split it into training and test sets. Finally, we checked the sizes of our splits to ensure everything was correctly set up.

Great job! Now, it’s time to practice what you’ve learned. You will get hands-on experience applying the train-test split to different datasets, ensuring you’re ready to evaluate your models correctly. Remember, practice is key to mastering machine learning concepts!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.