Preprocessing the Wine Dataset for PyTorch

Lesson 1

Topic Overview

Welcome to the lesson on Preprocessing the Wine Dataset for PyTorch! In this session, we will focus on preparing the Wine dataset for use in machine learning models with PyTorch. By the end of this lesson, you will have a thorough understanding of how to load datasets, split them, scale the features, and convert them into PyTorch tensors.

Why is preprocessing necessary?

Loading Data: We need to load the dataset into our environment to work with it.
Splitting Data: To evaluate the performance of our model, it's necessary to split the data into training and test sets.
Scaling Features: Many machine learning algorithms perform better when features are on a similar scale.
Converting to Tensors: PyTorch models work with tensors, so we need to convert our data into the appropriate format.

Loading and Exploring the Wine Dataset

The Wine dataset is a classic dataset in machine learning, often used for classification tasks. It consists of 178 instances of wine where each instance is described by 13 numeric features (such as alcohol content, malic acid, etc.). The target variable is a class label indicating the type of wine (three classes), which will be predicted using the 13 features.

To start our preprocessing steps, we need to load the Wine dataset and explore its structure. We'll use sklearn.datasets to load the dataset.

Python
1import torch
2from sklearn.datasets import load_wine
3
4# Load the Wine dataset
5wine = load_wine()
6
7# Explore dataset features and target classes
8print("Features:", wine.feature_names)
9print("Target classes:", wine.target_names)

The output will give you an idea of what the dataset looks like and the features it contains.

Plain text
1Features: [
2    'alcohol',
3    'malic_acid', 
4    'ash', 
5    'alcalinity_of_ash', 
6    'magnesium', 
7    'total_phenols', 
8    'flavanoids', 
9    'nonflavanoid_phenols', 
10    'proanthocyanins', 
11    'color_intensity', 
12    'hue', 
13    'od280/od315_of_diluted_wines', 
14    'proline']
15    
16Target classes: ['class_0' 'class_1' 'class_2']

Splitting the Dataset

Splitting the dataset into training and testing sets is crucial for evaluating the performance of machine learning models. We use the training set to train our model, ensuring it learns patterns within the data. The testing set, which the model hasn't seen during training, is used to evaluate the model's performance. This helps us understand how well the model generalizes to unseen data.

We'll use train_test_split from sklearn.model_selection for this task. It's important to use stratify for y, because it ensures that class proportions are preserved in both training and testing sets. The test_size=0.3 parameter means that 30% of the data will be used for testing, while the remaining 70% will be used for training.

Python
1from sklearn.model_selection import train_test_split
2
3X, y = wine.data, wine.target
4
5# Split the dataset into training and testing sets
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
7
8# Display the shapes of the resulting splits
9print("Shape of X_train:", X_train.shape)
10print("Shape of X_test:", X_test.shape)
11print("Shape of y_train:", y_train.shape)
12print("Shape of y_test:", y_test.shape)

The output of the above code will be:

Plain text
1Shape of X_train: (124, 13)
2Shape of X_test: (54, 13)
3Shape of y_train: (124,)
4Shape of y_test: (54,)

This output shows the dimensions of our training and test sets. It confirms that the dataset has been split accordingly, with 124 samples in the training set and 54 samples in the test set, while keeping the feature count constant at 13 for both sets.

Feature Scaling

Feature scaling is critical for many machine learning algorithms. It ensures that features are on similar scales, which can improve the performance of algorithms. For that we'll use StandardScaler from sklearn.preprocessing to transform the features so they all have a mean of 0 and a standard deviation of 1.

Python
1from sklearn.preprocessing import StandardScaler
2
3# Initialize the scaler and fit it to the training data
4scaler = StandardScaler().fit(X_train)
5
6# Transform both the training and testing datasets
7X_train_scaled = scaler.transform(X_train)
8X_test_scaled = scaler.transform(X_test)
9
10# Display scaled and unscaled samples
11print("Unscaled X sample:\n", X_train[0])
12print("Scaled X sample:\n", X_train_scaled[0])

While initializing our scaler, we can use the fit method to calculate the mean and standard deviation for each feature in the training set, which are used to transform the features. As shown in the ouput, all features should have similar scales after scaling, making them more suitable for machine learning models.

Plain text
1Unscaled X sample:
2 [1.358e+01 2.580e+00 2.690e+00 2.450e+01 1.050e+02 1.550e+00 8.400e-01
3 3.900e-01 1.540e+00 8.660e+00 7.400e-01 1.800e+00 7.500e+02]
4Scaled X sample:
5 [ 0.74011523  0.13609281  1.15758212  1.44877623  0.43868637 -1.22617445
6 -1.17874888  0.20205125 -0.06764496  1.55049978 -0.9240069  -1.15909415
7  0.03351788]

Converting Data to PyTorch Tensors

To use the data with PyTorch, we need to convert the NumPy arrays into PyTorch tensors. This conversion is essential as PyTorch models require input data in tensor format.

Python
1# Convert scaled data to PyTorch tensors
2X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
3X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)
4y_train_tensor = torch.tensor(y_train, dtype=torch.long)
5y_test_tensor = torch.tensor(y_test, dtype=torch.long)
6
7# Display example tensors
8print("Sample of X_train_tensor:", X_train_tensor[0])
9print("Sample of y_train_tensor:", y_train_tensor[0])

We use dtype=torch.float32 for features to ensure numeric precision during computations, and dtype=torch.long for labels because classification tasks in PyTorch expect labels to be in integer format. Specifically, torch.long represents 64-bit integer data type, which is required by loss functions such as nn.CrossEntropyLoss in PyTorch. This ensures that our labels are treated as categorical values and not as continuous numerical values, which is essential for accurate model training and evaluation.

The output of the above code will be:

Plain text
1Sample of X_train_tensor: tensor([ 0.3371, -0.2918,  1.1658, -0.8438,  0.2092,  1.0781,  1.1709, -0.4048,
2         0.0715,  0.3571,  0.5270,  0.2754,  1.5964])
3Sample of y_train_tensor: tensor(0)

This output is a numerical representation of the scaled features of the first wine sample in the training set and its corresponding class label, where 0 means the wine belongs to class 0. It demonstrates how data is transformed into tensor format, ready for use in PyTorch models.

Summary

In summary, we covered the structure and significance of the Wine dataset, loaded and explored its features and target classes, split the data into training and testing sets using stratified sampling, scaled the features to have a mean of 0 and a standard deviation of 1, and converted the scaled data into PyTorch tensors for use in PyTorch models.

Up next, you will have practice exercises to solidify your understanding and ensure you are comfortable with these preprocessing steps. Mastering these basics is crucial for your journey as a machine learning engineer. Keep practicing, and you'll see how these foundational skills translate into better models and more accurate predictions!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.