Hello and welcome back! Today we focus on defining datasets using PyTorch Tensors. These datasets are called TensorDatasets and are a very vital feature of the PyTorch library. In this lesson, you will convert an array into a tensor, create a TensorDataset, use DataLoader for dividing the dataset into batches, and iterate through the batches. Let's dive right into it!
As you might already know, PyTorch's primary unit of data storage is a tensor. But what if you have more than one tensor of data and you need to keep it collected? That's when TensorDataset
comes into play.
A TensorDataset
is a dataset that wraps multiple tensors. Each sample is a tuple of tensors where each tensor in the tuple corresponds to a level of the dataset. In simpler terms, it is a way to keep your tensors of input and output data organized together. Using TensorDataset
makes it very easy to provide and manage your tensors of varying data types.
While it’s not always necessary to use TensorDataset
, it can be very convenient, especially if you want to use a DataLoader
for batching and shuffling your data. The major advantage here is that using TensorDataset
, PyTorch can efficiently store and access the data, which is crucial while working with large datasets.
Now, let's get our hands on the first step of our journey - defining an array and converting it into a tensor. For our lesson, we'll start with a simple array of input data and the target outputs for our dataset.
Python1import numpy as np 2 3# Define a simple array as input data 4X = np.array([[1.0, 2.0], [2.0, 1.0], [3.0, 4.0], [4.0, 3.0]]) 5# Define the target outputs for our dataset 6y = np.array([0, 1, 0, 1])
Up till here, we have just defined them as numpy arrays. We now have to convert them into PyTorch tensors.
The conversion code is straightforward, the torch.tensor
function helps us transform our numpy array into tensors, and, with the use of the dtype
parameter, we can specify them as floating point and integer numbers.
Python1import torch 2 3# Convert X and y into PyTorch tensors 4X_tensor = torch.tensor(X, dtype=torch.float32) 5y_tensor = torch.tensor(y, dtype=torch.int32)
Now we have successfully converted our arrays into PyTorch tensors!
We now have our tensors and we can build a TensorDataset
. Let's see how we can achieve that.
Python1from torch.utils.data import TensorDataset 2 3# Create a tensor dataset 4dataset = TensorDataset(X_tensor, y_tensor)
As you can see, the input to TensorDataset
is the tensors we created above. TensorDataset
will bundle or rather, wrap these tensors together into a single dataset.
Let's print the contents of our TensorDataset
to confirm it's properly created.
Python1# Print x and y of the TensorDataset 2for i in range(len(dataset)): 3 X_sample, y_sample = dataset[i] 4 print(f"X[{i}]: {X_sample}, y[{i}]: {y_sample}")
The output will be something like:
1X[0]: tensor([1., 2.]), y[0]: 0 2X[1]: tensor([2., 1.]), y[1]: 1 3X[2]: tensor([3., 4.]), y[2]: 0 4X[3]: tensor([4., 3.]), y[3]: 1
This confirms that our individual tensors have been correctly wrapped into the TensorDataset
.
Helping in the effective management of large datasets and easier iterating over data batches, PyTorch provides a tool named DataLoader
. It allows efficient access to data and can really speed up your model training process.
DataLoader
takes in a dataset and other parameters like batch_size
, which defines the number of samples to work with per batch, and shuffle
, which indicates to shuffle the data every epoch when set to True.
Using a TensorDataset
with DataLoader
is highly convenient as it allows for seamless handling of inputs and targets together in batches.
Python1from torch.utils.data import DataLoader 2 3# Create a data loader 4dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
Finally, we are now able to use the DataLoader
and iterate through our dataset in batches. This process is fundamental in training Machine Learning models, as it allows the model to generalize better and also enables us to work with larger datasets by fitting only a batch of data in the memory at a time.
Let's now print our batches of data.
Python1# Iterate through the dataloader 2for batch_X, batch_y in dataloader: 3 print(f"Batch X:\n{batch_X}") 4 print(f"Batch y:\n{batch_y}\n")
The output will be something like:
1Batch X: 2tensor([[1., 2.], 3 [2., 1.]]) 4Batch y: 5tensor([0, 1], dtype=torch.int32) 6 7Batch X: 8tensor([[4., 3.], 9 [3., 4.]]) 10Batch y: 11tensor([1, 0], dtype=torch.int32)
This output illustrates how DataLoader
allows us to shuffle and batch our data efficiently. Due to the shuffling, the presented batches and their order might vary each time the code is executed. This is beneficial for model generalization during training.
That's a wrap! You should now have a good understanding of defining PyTorch tensors and the convenience of using TensorDataset
especially when paired with DataLoader
. We also looked at iterating the DataLoader
.
Now it's your turn to strengthen your newly acquired skills through practice. The upcoming exercises will give you an opportunity to apply today's lesson. These situations with datasets commonly arise in a Machine Learning Engineer's daily work, hence proficiency in these skills is of utmost importance.
Keep practicing and happy learning!