Lesson 4

Hello and welcome back! Today we focus on defining datasets using **PyTorch** Tensors. These datasets are called *TensorDatasets* and are a very vital feature of the PyTorch library. In this lesson, you will convert an array into a tensor, create a *TensorDataset*, use *DataLoader* for dividing the dataset into batches, and iterate through the batches. Let's dive right into it!

As you might already know, PyTorch's primary unit of data storage is a tensor. But what if you have more than one tensor of data and you need to keep it collected? That's when `TensorDataset`

comes into play.

A `TensorDataset`

is a dataset that wraps multiple tensors. Each sample is a tuple of tensors where each tensor in the tuple corresponds to a level of the dataset. In simpler terms, it is a way to keep your tensors of input and output data organized together. Using `TensorDataset`

makes it very easy to provide and manage your tensors of varying data types.

While it’s not always necessary to use `TensorDataset`

, it can be very convenient, especially if you want to use a `DataLoader`

for batching and shuffling your data. The major advantage here is that using `TensorDataset`

, PyTorch can efficiently store and access the data, which is crucial while working with large datasets.

Now, let's get our hands on the first step of our journey - defining an array and converting it into a tensor. For our lesson, we'll start with a simple array of input data and the target outputs for our dataset.

Python`1import numpy as np 2 3# Define a simple array as input data 4X = np.array([[1.0, 2.0], [2.0, 1.0], [3.0, 4.0], [4.0, 3.0]]) 5# Define the target outputs for our dataset 6y = np.array([0, 1, 0, 1])`

Up till here, we have just defined them as numpy arrays. We now have to convert them into PyTorch tensors.

The conversion code is straightforward, the `torch.tensor`

function helps us transform our numpy array into tensors, and, with the use of the `dtype`

parameter, we can specify them as floating point and integer numbers.

Python`1import torch 2 3# Convert X and y into PyTorch tensors 4X_tensor = torch.tensor(X, dtype=torch.float32) 5y_tensor = torch.tensor(y, dtype=torch.int32)`

Now we have successfully converted our arrays into PyTorch tensors!

We now have our tensors and we can build a `TensorDataset`

. Let's see how we can achieve that.

Python`1from torch.utils.data import TensorDataset 2 3# Create a tensor dataset 4dataset = TensorDataset(X_tensor, y_tensor)`

As you can see, the input to `TensorDataset`

is the tensors we created above. `TensorDataset`

will bundle or rather, wrap these tensors together into a single dataset.

Let's print the contents of our `TensorDataset`

to confirm it's properly created.

Python`1# Print x and y of the TensorDataset 2for i in range(len(dataset)): 3 X_sample, y_sample = dataset[i] 4 print(f"X[{i}]: {X_sample}, y[{i}]: {y_sample}")`

The output will be something like:

`1X[0]: tensor([1., 2.]), y[0]: 0 2X[1]: tensor([2., 1.]), y[1]: 1 3X[2]: tensor([3., 4.]), y[2]: 0 4X[3]: tensor([4., 3.]), y[3]: 1`

This confirms that our individual tensors have been correctly wrapped into the `TensorDataset`

.

Helping in the effective management of large datasets and easier iterating over data batches, PyTorch provides a tool named `DataLoader`

. It allows efficient access to data and can really speed up your model training process.

`DataLoader`

takes in a dataset and other parameters like `batch_size`

, which defines the number of samples to work with per batch, and `shuffle`

, which indicates to shuffle the data every epoch when set to True.

Using a `TensorDataset`

with `DataLoader`

is highly convenient as it allows for seamless handling of inputs and targets together in batches.

Python`1from torch.utils.data import DataLoader 2 3# Create a data loader 4dataloader = DataLoader(dataset, batch_size=2, shuffle=True)`

Finally, we are now able to use the `DataLoader`

and iterate through our dataset in batches. This process is fundamental in training Machine Learning models, as it allows the model to generalize better and also enables us to work with larger datasets by fitting only a batch of data in the memory at a time.

Let's now print our batches of data.

Python`1# Iterate through the dataloader 2for batch_X, batch_y in dataloader: 3 print(f"Batch X:\n{batch_X}") 4 print(f"Batch y:\n{batch_y}\n")`

The output will be something like:

`1Batch X: 2tensor([[1., 2.], 3 [2., 1.]]) 4Batch y: 5tensor([0, 1], dtype=torch.int32) 6 7Batch X: 8tensor([[4., 3.], 9 [3., 4.]]) 10Batch y: 11tensor([1, 0], dtype=torch.int32)`

This output illustrates how `DataLoader`

allows us to shuffle and batch our data efficiently. Due to the shuffling, the presented batches and their order might vary each time the code is executed. This is beneficial for model generalization during training.

That's a wrap! You should now have a good understanding of defining PyTorch tensors and the convenience of using `TensorDataset`

especially when paired with `DataLoader`

. We also looked at iterating the `DataLoader`

.

Now it's your turn to strengthen your newly acquired skills through practice. The upcoming exercises will give you an opportunity to apply today's lesson. These situations with datasets commonly arise in a Machine Learning Engineer's daily work, hence proficiency in these skills is of utmost importance.

Keep practicing and happy learning!