Splitting Data into Train and Test Sets

Lesson 1

Topic Overview

Hello and welcome! In today's lesson, we will explore the critical step of splitting data into training and testing sets, which is foundational for building any robust regression model. By the end of this lesson, you'll be capable of taking a dataset and accurately dividing it into training and testing sets.

Understanding the Importance of Training and Testing Sets

When developing a machine learning model, it's essential to test its performance on unseen data. This is because while a model may perform well on the training set, it is the performance on the testing set that determines how well the model generalizes to new, unseen data. Without this split, we risk overestimating the model's capabilities due to its exposure to the training data alone.

By splitting the dataset into training and testing sets, we allow the model to learn on one subset of the data (training set) and evaluate its performance on another subset (testing set). This ensures that the model generalizes well to new data, making it more robust and reliable.

One-Hot Encoding vs. Categorical Encoding

Before we can use the diamonds dataset in our data sets, we need to preprocess it by converting categorical variables into numerical values.

One-hot encoding is a method where each category value is converted into a new binary column. Each column represents a category, and the values are 0 or 1, indicating the absence or presence of the category. This is particularly useful for machine learning algorithms that require numerical input and can benefit from each category being represented as a distinct feature. In Pandas, we use the pd.get_dummies function to achieve one-hot encoding.

Here's how to implement one-hot encoding for our diamonds dataset:

Python
1import seaborn as sns
2import pandas as pd
3
4# Load the diamonds dataset
5diamonds = sns.load_dataset('diamonds')
6
7# Convert categorical variables to dummy/indicator variables
8diamonds = pd.get_dummies(diamonds, drop_first=True)
9
10# Display first few rows of the transformed dataset
11print(diamonds.head())

Output:

Plain text
1   carat  depth  table  price     x     y     z  cut_Premium  cut_Very Good    cut_Good  cut_Fair   color_E  ...
20   0.23   61.5   55.0    326  3.95  3.98  2.43        False          False       False     False      True  ...
31   0.21   59.8   61.0    326  3.89  3.84  2.31         True          False       False     False      True  ...
42   0.23   56.9   65.0    327  4.05  4.07  2.31        False          False        True     False      True  ...
53   0.29   62.4   58.0    334  4.20  4.23  2.63         True          False       False     False     False  ...
64   0.31   63.3   58.0    335  4.34  4.35  2.75        False          False        True     False     False  ...

The drop_first=True parameter in the pd.get_dummies function drops the first category of each categorical variable to avoid multicollinearity, allowing the remaining categories to uniquely represent the data. This ensures that each feature set used in the model is statistically independent and prevents redundancy. Consequently, models can be trained more efficiently without the risk of perfect collinearity.

Note that one-hot encoding is similar but different to the previously mentioned method of categorical encoding, which converts each category to a unique integer code. Categorical encoding is less memory-intensive and can be useful for models that can handle categorical features natively, such as tree-based models. However, for linear models and certain other algorithms, one-hot encoding is often preferred as it avoids the implicit ordinal relationship inferred by categorical encoding.

Obtaining Features and Labels

Now, it’s time to split the dataset into features and labels using train_test_split from scikit-learn. This function will help us create a training set (used to train the model) and a testing set (used to evaluate the model).

First, let's define the features (X) and the target variable (y):

Python
1from sklearn.model_selection import train_test_split
2
3# Selecting features and target variable
4# Drop the 'price' column from X to ensure that X only contains input features and not the target variable.
5X = diamonds.drop('price', axis=1)
6y = diamonds['price']

drop is used to remove the target variable column from the feature dataset to ensure that X only contains input features. This prevents the target variable from inadvertently being used as a feature, which would inflate the model's performance metrics and lead to incorrect conclusions. axis=1 is used to indicate that a column will be dropped, rather than a row.

Note that this does affect the original diamonds variable, making it possible to subsequently only extract the prices diamonds['price'].

Splitting the Dataset into Train and Test Sets

Now that we have separated the features and the target variable, we can split the dataset into the train and test data set. The train_test_split function takes several parameters, including test_size, which defines the proportion of the dataset to be used as the testing set, and random_state which ensures reproducibility.

Python
1# Splitting the data into training and testing sets
2X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Verifying the Split and Understanding Shapes

It's crucial to check the shapes of the resulting datasets to confirm that the data has been split correctly and the proportions are as expected.

Python
1# Print the shapes of the training and testing sets
2print("Data split successfully!")
3print(f"X_train shape: {X_train.shape}")
4print(f"X_test shape: {X_test.shape}")
5print(f"y_train shape: {y_train.shape}")
6print(f"y_test shape: {y_test.shape}")

Output:

Plain text
1Data split successfully!
2X_train shape: (43152, 26)
3X_test shape: (10788, 26)
4y_train shape: (43152,)
5y_test shape: (10788,)

You'll see that the shapes confirm the data split is correct. Make sure the sum of the lengths of X_train and X_test equals the original dataset length.

Python
1# Verify that the sum of training and testing samples equals the original dataset
2assert len(X_train) + len(X_test) == len(diamonds)
3assert len(y_train) + len(y_test) == len(diamonds)

If the assertions pass without errors, then the dataset is split correctly.

Lesson Summary and Practice

Congratulations! You've successfully learned how to split a dataset into training and testing sets. This lesson was critical because it ensures that your machine learning models are trained and evaluated on separate data, providing a more realistic measure of model performance. Keep practicing, and soon, splitting datasets will become second nature to you!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.