Preprocessing the Iris Dataset for TensorFlow

Lesson 1

Introduction

In today's lesson, our focus is on preprocessing the Iris dataset for TensorFlow. We will explore various techniques, such as data splitting, feature scaling, and one-hot encoding. This foundation is invaluable in the field of machine learning as it aids in understanding the intricacies of data transformation before we feed it to a neural network. Let's get into it!

Overview of the Iris Dataset

Before we delve into data preprocessing, it is imperative to understand the data we are processing. The Iris dataset comprises measurements from 150 Iris flowers coming from three different species. Each sample includes the following 4 features:

Sepal length (cm): e.g., 5.1, 4.9, 4.7, etc.
Sepal width (cm): e.g., 3.5, 3.0, 3.2, etc.
Petal length (cm): e.g., 1.4, 1.4, 1.3, etc.
Petal width (cm): e.g., 0.2, 0.2, 0.2, etc.

Additionally, each sample has a class label representing the Iris species. The targets in the dataset are represented as one of the following options:

Iris setosa: 0
Iris versicolor: 1
Iris virginica: 2

With these measurements and labels, the Iris dataset becomes a multivariate dataset often used for machine learning introductions.

Insight into Data Preprocessing

Data preprocessing is a crucial step in machine learning. It is the process of converting or mapping data from the initial form to another format to prepare the data for the next processing phase. This converted data could be easier for the algorithms to extract information, hence improving their ability to predict. The steps involved in preprocessing we will cover in today's lesson include data load, split, scale, and encode.

Step 1: Loading the Dataset

Before diving into preprocessing, let's start by loading the Iris dataset. We use the load_iris function from scikit-learn for this purpose. It returns the feature matrix X and the target vector y.

Python
1from sklearn.datasets import load_iris
2
3# Load the Iris dataset
4iris = load_iris()
5X, y = iris.data, iris.target
6
7# Displaying shapes
8print(f'X shape: {X.shape}')
9print(f'y shape: {y.shape}')

The output will be:

Plain text
1X shape: (150, 4)
2y shape: (150,)

Here, X contains 150 samples, each with 4 features (sepal length, sepal width, petal length, and petal width). The y vector contains 150 class labels, with each label representing one of the three Iris species. This initial step helps us understand the dimensions of our dataset before we proceed with further processing.

Step 2: Splitting into Training and Testing Sets

The initial step in preprocessing itself is data splitting. We divide the dataset into two parts: a training set and a testing set. The training set is used to train the model, while the testing set validates its performance. Typically, we use scikit-learn's train_test_split function for this purpose. By splitting the data, we ensure that our model can generalize well to new, unseen data. The stratify parameter ensures that the proportion of different classes in the split datasets is the same as in the original dataset. For our specific example, we will use 70% of the data for training and 30% for testing.

Python
1from sklearn.model_selection import train_test_split
2
3# Split the dataset into training and testing sets
4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

Step 3: Feature Scaling

After splitting the data, we perform feature scaling to normalize the range of independent variables or features. This step is crucial because it ensures all input features have the same scale, preventing features with larger scales from dominating those with smaller scales. We achieve this normalization using the StandardScaler from scikit-learn, which standardizes features by centering the data to have a mean of 0 and scaling to unit variance. The fit method calculates the mean and standard deviation for scaling based on the training data.

Python
1from sklearn.preprocessing import StandardScaler
2
3# Scale the features
4scaler = StandardScaler()
5scaler.fit(X_train)
6X_train_scaled = scaler.transform(X_train)
7X_test_scaled = scaler.transform(X_test)

Step 4: Target Encoding

The final preprocessing step is data encoding. The target variables in the Iris dataset are categorical and must be converted into a format that our machine learning model can utilize. This is done using one-hot encoding, which transforms categorical data into a binary (0 or 1) format. For example, a variable labeled as 1 (Iris versicolor) would be represented as [0, 1, 0] after one-hot encoding. We use the OneHotEncoder from scikit-learn to perform this action, ensuring our target variables are ready for input into the model. The fit method learns the unique categories present in the training data, which will be used for encoding.

Python
1from sklearn.preprocessing import OneHotEncoder
2
3# One-hot encode the targets
4encoder = OneHotEncoder(sparse_output=False)
5encoder.fit(y_train.reshape(-1, 1))
6y_train_encoded = encoder.transform(y_train.reshape(-1, 1))
7y_test_encoded = encoder.transform(y_test.reshape(-1, 1))

Data Preprocessing in Practice

Below are the summarized preprocessing steps, including data loading, splitting, scaling, and encoding in one section encapsulated in a single function. This function facilitates modularization, allowing us to use the processed data imported in another file where we develop our model.

Python
1import numpy as np
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4from sklearn.preprocessing import StandardScaler, OneHotEncoder
5
6def load_preprocessed_data():
7    # Load the Iris dataset
8    iris = load_iris()
9    X, y = iris.data, iris.target
10
11    # Split the dataset into training and testing sets
12    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
13
14    # Scale the features
15    scaler = StandardScaler().fit(X_train)
16    X_train_scaled = scaler.transform(X_train)
17    X_test_scaled = scaler.transform(X_test)
18
19    # One-hot encode the targets
20    encoder = OneHotEncoder(sparse_output=False).fit(y_train.reshape(-1, 1))
21    y_train_encoded = encoder.transform(y_train.reshape(-1, 1))
22    y_test_encoded = encoder.transform(y_test.reshape(-1, 1))
23
24    return X_train_scaled, X_test_scaled, y_train_encoded, y_test_encoded

Loading and Printing Preprocessed Data

After defining the function that preprocesses the data, we can load the preprocessed data and print a sample of the training input and target.

Python
1# Load preprocessed data
2X_train, X_test, y_train, y_test = load_preprocessed_data()
3
4# Print a sample of one training input and target
5print(f'Sample of preprocessed X_train: {X_train[0]}')
6print(f'Sample of preprocessed y_train: {y_train[0]}\n')
7
8# Print the shape of scaled and encoded data
9print(f'Shape of preprocessed X_train: {X_train.shape}')
10print(f'Shape of preprocessed X_test: {X_test.shape}')
11print(f'Shape of preprocessed y_train: {y_train.shape}')
12print(f'Shape of preprocessed y_test: {y_test.shape}')

The output of the above code will be:

Plain text
1Sample of preprocessed X_train: [-0.90045861 -1.22024754 -0.4419858  -0.13661044]
2Sample of preprocessed y_train: [0. 1. 0.]
3
4Shape of preprocessed X_train: (105, 4)
5Shape of preprocessed X_test: (45, 4)
6Shape of preprocessed y_train: (105, 3)
7Shape of preprocessed y_test: (45, 3)

This output illustrates the results of our preprocessing steps — scaling of feature data to ensure a standardized dataset and one-hot encoding of target variables to prepare them for machine learning models.

Lesson Summary and Practice

In conclusion, we have successfully preprocessed the Iris dataset and made it ready for machine learning modeling with TensorFlow. We've loaded, split, scaled, and encoded the data using Python. This fundamental knowledge is essential to you as a Machine Learning Engineer to improve accuracy and build efficient models using TensorFlow.

Next, we will have exercises to consolidate these preprocessing steps. The exercises aim to enhance your understanding and application of data preprocessing and prepare you for more challenging tasks in the future. Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.