In today's lesson, our focus is on preprocessing the Iris dataset for TensorFlow. We will explore various techniques, such as data splitting, feature scaling, and one-hot encoding. This foundation is invaluable in the field of machine learning as it aids in understanding the intricacies of data transformation before we feed it to a neural network. Let's get into it!
Before we delve into data preprocessing, it is imperative to understand the data we are processing. The Iris dataset comprises measurements from 150 Iris flowers coming from three different species. Each sample includes the following 4 features:
- Sepal length (cm): e.g., 5.1, 4.9, 4.7, etc.
- Sepal width (cm): e.g., 3.5, 3.0, 3.2, etc.
- Petal length (cm): e.g., 1.4, 1.4, 1.3, etc.
- Petal width (cm): e.g., 0.2, 0.2, 0.2, etc.
Additionally, each sample has a class label representing the Iris species. The targets in the dataset are represented as one of the following options:
- Iris setosa: 0
- Iris versicolor: 1
- Iris virginica: 2
With these measurements and labels, the Iris dataset becomes a multivariate dataset often used for machine learning introductions.
Data preprocessing is a crucial step in machine learning. It is the process of converting or mapping data from the initial form to another format to prepare the data for the next processing phase. This converted data could be easier for the algorithms to extract information, hence improving their ability to predict. The steps involved in preprocessing we will cover in today's lesson include data load, split, scale, and encode.
Before diving into preprocessing, let's start by loading the Iris dataset. We use the load_iris
function from scikit-learn
for this purpose. It returns the feature matrix X
and the target vector y
.
Python1from sklearn.datasets import load_iris 2 3# Load the Iris dataset 4iris = load_iris() 5X, y = iris.data, iris.target 6 7# Displaying shapes 8print(f'X shape: {X.shape}') 9print(f'y shape: {y.shape}')
The output will be:
Plain text1X shape: (150, 4) 2y shape: (150,)
Here, X
contains 150 samples, each with 4 features (sepal length, sepal width, petal length, and petal width). The y
vector contains 150 class labels, with each label representing one of the three Iris species. This initial step helps us understand the dimensions of our dataset before we proceed with further processing.
The initial step in preprocessing itself is data splitting. We divide the dataset into two parts: a training set and a testing set. The training set is used to train the model, while the testing set validates its performance. Typically, we use scikit-learn
's train_test_split
function for this purpose. By splitting the data, we ensure that our model can generalize well to new, unseen data. The stratify
parameter ensures that the proportion of different classes in the split datasets is the same as in the original dataset. For our specific example, we will use 70% of the data for training and 30% for testing.
Python1from sklearn.model_selection import train_test_split 2 3# Split the dataset into training and testing sets 4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
After splitting the data, we perform feature scaling to normalize the range of independent variables or features. This step is crucial because it ensures all input features have the same scale, preventing features with larger scales from dominating those with smaller scales. We achieve this normalization using the StandardScaler
from scikit-learn
, which standardizes features by centering the data to have a mean of 0 and scaling to unit variance. The fit
method calculates the mean and standard deviation for scaling based on the training data.
Python1from sklearn.preprocessing import StandardScaler 2 3# Scale the features 4scaler = StandardScaler() 5scaler.fit(X_train) 6X_train_scaled = scaler.transform(X_train) 7X_test_scaled = scaler.transform(X_test)
The final preprocessing step is data encoding. The target variables in the Iris dataset are categorical and must be converted into a format that our machine learning model can utilize. This is done using one-hot encoding, which transforms categorical data into a binary (0 or 1) format. For example, a variable labeled as 1
(Iris versicolor) would be represented as [0, 1, 0]
after one-hot encoding. We use the OneHotEncoder
from scikit-learn to perform this action, ensuring our target variables are ready for input into the model. The fit
method learns the unique categories present in the training data, which will be used for encoding.
Python1from sklearn.preprocessing import OneHotEncoder 2 3# One-hot encode the targets 4encoder = OneHotEncoder(sparse_output=False) 5encoder.fit(y_train.reshape(-1, 1)) 6y_train_encoded = encoder.transform(y_train.reshape(-1, 1)) 7y_test_encoded = encoder.transform(y_test.reshape(-1, 1))
Below are the summarized preprocessing steps, including data loading, splitting, scaling, and encoding in one section encapsulated in a single function. This function facilitates modularization, allowing us to use the processed data imported in another file where we develop our model.
Python1import numpy as np 2from sklearn.datasets import load_iris 3from sklearn.model_selection import train_test_split 4from sklearn.preprocessing import StandardScaler, OneHotEncoder 5 6def load_preprocessed_data(): 7 # Load the Iris dataset 8 iris = load_iris() 9 X, y = iris.data, iris.target 10 11 # Split the dataset into training and testing sets 12 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42) 13 14 # Scale the features 15 scaler = StandardScaler().fit(X_train) 16 X_train_scaled = scaler.transform(X_train) 17 X_test_scaled = scaler.transform(X_test) 18 19 # One-hot encode the targets 20 encoder = OneHotEncoder(sparse_output=False).fit(y_train.reshape(-1, 1)) 21 y_train_encoded = encoder.transform(y_train.reshape(-1, 1)) 22 y_test_encoded = encoder.transform(y_test.reshape(-1, 1)) 23 24 return X_train_scaled, X_test_scaled, y_train_encoded, y_test_encoded
After defining the function that preprocesses the data, we can load the preprocessed data and print a sample of the training input and target.
Python1# Load preprocessed data 2X_train, X_test, y_train, y_test = load_preprocessed_data() 3 4# Print a sample of one training input and target 5print(f'Sample of preprocessed X_train: {X_train[0]}') 6print(f'Sample of preprocessed y_train: {y_train[0]}\n') 7 8# Print the shape of scaled and encoded data 9print(f'Shape of preprocessed X_train: {X_train.shape}') 10print(f'Shape of preprocessed X_test: {X_test.shape}') 11print(f'Shape of preprocessed y_train: {y_train.shape}') 12print(f'Shape of preprocessed y_test: {y_test.shape}')
The output of the above code will be:
Plain text1Sample of preprocessed X_train: [-0.90045861 -1.22024754 -0.4419858 -0.13661044] 2Sample of preprocessed y_train: [0. 1. 0.] 3 4Shape of preprocessed X_train: (105, 4) 5Shape of preprocessed X_test: (45, 4) 6Shape of preprocessed y_train: (105, 3) 7Shape of preprocessed y_test: (45, 3)
This output illustrates the results of our preprocessing steps — scaling of feature data to ensure a standardized dataset and one-hot encoding of target variables to prepare them for machine learning models.
In conclusion, we have successfully preprocessed the Iris dataset and made it ready for machine learning modeling with TensorFlow. We've loaded, split, scaled, and encoded the data using Python. This fundamental knowledge is essential to you as a Machine Learning Engineer to improve accuracy and build efficient models using TensorFlow.
Next, we will have exercises to consolidate these preprocessing steps. The exercises aim to enhance your understanding and application of data preprocessing and prepare you for more challenging tasks in the future. Happy learning!