Welcome! In today's lesson, we will learn how to split a dataset into training and testing sets. This is a crucial step in preparing your data for machine learning models to ensure they generalize well to unseen data.
Lesson Goal: By the end of this lesson, you will understand how to split financial datasets, such as Tesla's stock data, into training and testing sets using Python
.
Before we delve into splitting the dataset, let's briefly review the preprocessing steps we have covered so far. The dataset has been loaded, new features have been engineered, and the features have been scaled.
Here's the code for those steps for a quick revision:
Python1import pandas as pd 2from sklearn.preprocessing import StandardScaler 3import datasets 4 5# Loading and preprocessing the dataset (revision) 6data = datasets.load_dataset('codesignal/tsla-historic-prices') 7tesla_df = pd.DataFrame(data['train']) 8tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low'] 9tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open'] 10 11# Defining features and target 12features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values 13# Target is the column that we are trying to predict 14target = tesla_df['Close'].values 15 16# Scaling 17scaler = StandardScaler() 18features_scaled = scaler.fit_transform(features)
To avoid overfitting
, where a model learns the training data too well and performs poorly on new, unseen data, it's important to evaluate your machine learning model on data it has never seen before. This is where splitting datasets into training and testing sets comes into play.
Why Split?
This ensures that your model's performance is not just tailored to the training data but can be generalized to new inputs.
The train_test_split
function from sklearn.model_selection
helps us easily split the data.
Parameters of train_test_split
:
test_size
: The proportion of the dataset to include in the test split (e.g., 0.25
means 25% of the data will be used for testing).train_size
: The proportion of the dataset to include in the train split (optional if test_size
is provided).random_state
: Controls the shuffling applied to the data before the split. Providing a fixed value ensures reproducibility.Let's split our scaled features and targets into training and testing sets:
Python1from sklearn.model_selection import train_test_split 2 3# Splitting the dataset 4X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.25, random_state=42)
The train_test_split
function will split our dataset into training and testing sets:
features_scaled
and target
are the inputs.test_size=0.25
means 25% of the data goes to the test set.random_state=42
ensures reproducibility. The state can be any other number, too.After splitting the dataset, it's important to verify the shapes and the contents of the resulting sets to ensure the split was done correctly.
Checking Shapes:
Inspecting Sample Rows:
Let's check our split data:
Python1# Verify splits 2print(f"Training features shape: {X_train.shape}") 3print(f"Testing features shape: {X_test.shape}") 4 5print(f"First 5 rows of training features: \n{X_train[:5]}") 6print(f"First 5 training targets: {y_train[:5]}\n") 7 8print(f"First 5 rows of testing features: \n{X_test[:5]}") 9print(f"First 5 testing targets: {y_test[:5]}")
The output of the above code will be:
Plain text1Training features shape: (2510, 3) 2Testing features shape: (837, 3) 3First 5 rows of training features: 4[[-4.66075964e-01 6.80184955e-02 3.11378946e-01] 5 [ 4.01701510e+00 5.04529577e+00 -4.61555718e-02] 6 [ 2.04723437e+00 3.09900603e+00 9.43022378e-04] 7 [-5.30579018e-01 -2.30986178e-02 -5.67163058e-01] 8 [-4.78854883e-01 -5.79376618e-02 -6.94451021e-01]] 9First 5 training targets: [ 17.288 355.666656 222.419998 15.000667 13.092 ] 10 11First 5 rows of testing features: 12[[-0.36226203 0.2087143 0.69346624] 13 [ 1.27319589 1.04049732 0.58204785] 14 [-0.53556882 -0.03231093 -0.86874821] 15 [-0.49029475 0.07773304 -0.51784526] 16 [ 3.0026057 -4.41816938 -0.31923731]] 17First 5 testing targets: [ 23.209333 189.606674 14.730667 16.763332 325.733337]
This output confirms that our dataset has been successfully split into training and testing sets, showing the shape of each set and giving us a glimpse into the rows of our features and targets post-split. It's an important validation step to ensure our data is ready for machine learning model training and evaluation.
Great job! In this lesson, we:
train_test_split
to divide the dataset into training and testing sets.These steps are crucial for ensuring that your machine learning models can generalize well to new data. Up next, you'll have some practice exercises to solidify your understanding and improve your data preparation skills. Keep going!