Imagine you built a robot to recognize apples and oranges. But how do you know if it's good at this task? You need to test it on some new apples and oranges it hasn’t seen before. In machine learning, we do something similar by splitting our data into training and test sets. This helps us see how well our model performs on new data.
It helps indicating and preventing overfitting. Overfitting is when a machine learning model learns the training data too well, including noise and details that don’t apply to new data. This results in excellent performance on the training set but poor performance on the test set, indicating that the model has memorized specifics rather than understanding general patterns.
Today, we will learn how to split a dataset into training and test sets using the train_test_split
function from SciKit Learn. By the end of this lesson, you'll know how to prepare your data properly to evaluate your model.
A train-test split is cutting the dataset into two parts: one to train the model and one to test it. The training set helps the model learn patterns, and the test set helps us check if the model is good at predicting new data.
For example, if you have 10 pictures of fruits, you might use 8 to train your robot and 2 to test it. This ensures the robot hasn’t memorized the training pictures but can recognize new ones too.
To split the data, we use the train_test_split
function from the SciKit Learn library. This function makes it easy to divide your data randomly. Let’s first see how to import what we need:
Python1from sklearn.model_selection import train_test_split
Let's use a very small dataset. Imagine we have 10 fruit images (features) and their labels (like apple or orange). Here is our dataset:
Python1# Small dataset 2X = [[0.1], [0.2], [0.1], [0.5], [0.5], [0.2], [0.2], [0.4], [0.1], [0.2]] # 10 features 3y = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0] # 10 target labels
In this example, X
is your features (like fruit images), and y
is your target labels (like 0 for 'apple' or 1 for 'orange').
Now, let's use the train_test_split
function to divide our dataset into training and test sets:
Python1X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 2print(len(X_train)) # 8 3print(len(X_test)) # 2 4print(len(y_train)) # 8 5print(len(y_test)) # 2
Here’s what this does:
X_train
andy_train
are the training sets.X_test
andy_test
are the test sets.test_size=0.2
means 20% of the data is for testing, and 80% is for training. It is common to use 20-30% of your data for the test set.random_state=42
ensures the split is the same every time you run the code, which is handy for reproducibility. You can use any integer inrandom_state
,42
is just a random choice (or a reference to some book 😉)
In this lesson, we learned why it's important to split our data into training and test sets. We discussed overfitting, which is like memorizing homework answers but failing the test. We then explored the train_test_split
function, used a small dataset, and split it into training and test sets. Finally, we checked the sizes of our splits to ensure everything was correctly set up.
Great job! Now, it’s time to practice what you’ve learned. You will get hands-on experience applying the train-test split to different datasets, ensuring you’re ready to evaluate your models correctly. Remember, practice is key to mastering machine learning concepts!