Building a complete data pipeline involves starting from raw data (which is synthesized for simplicity) and ending with batches ready to be fed into a model. This process requires creating a custom Dataset, defining data transformations, and wrapping everything in a DataLoader.Setting Up a Synthetic DatasetImagine we have a dataset consisting of feature vectors and corresponding binary classification labels (0 or 1). For this exercise, we'll generate this data directly using PyTorch tensors. This avoids file I/O complexities and lets us focus purely on the data handling mechanism.import torch import torch.utils.data as data from torchvision import transforms # Generate synthetic data num_samples = 100 num_features = 10 # Create random feature vectors (e.g., sensor readings) features = torch.randn(num_samples, num_features) # Create random binary labels (0 or 1) labels = torch.randint(0, 2, (num_samples,)) print(f"Shape of features: {features.shape}") # Output: torch.Size([100, 10]) print(f"Shape of labels: {labels.shape}") # Output: torch.Size([100]) print(f"First 5 features:\n{features[:5]}") print(f"First 5 labels:\n{labels[:5]}")This gives us two tensors: features containing 100 samples, each with 10 features, and labels containing the corresponding 100 labels.Creating a Custom DatasetNow, we need to structure this data using PyTorch's Dataset class. We'll create a custom class that inherits from torch.utils.data.Dataset and implements two essential methods:__len__(self): Returns the total number of samples in the dataset.__getitem__(self, idx): Returns the sample (features and label) at the given index idx.We'll also add an __init__ method to store our data and optionally accept transformations.class SyntheticDataset(data.Dataset): """A custom Dataset for our synthetic features and labels.""" def __init__(self, features, labels, transform=None): """ Args: features (Tensor): Tensor containing the feature data. labels (Tensor): Tensor containing the labels. transform (callable, optional): Optional transform to be applied on a sample. """ # Basic check to ensure features and labels have the same number of samples assert features.shape[0] == labels.shape[0], \ "Features and labels must have the same number of samples" self.features = features self.labels = labels self.transform = transform def __len__(self): """Returns the total number of samples.""" return self.features.shape[0] def __getitem__(self, idx): """ Retrieves the feature vector and label for a given index. Args: idx (int): Index of the sample to retrieve. Returns: tuple: (feature, label) where feature is the feature vector and label is the corresponding label. """ # Get the raw feature and label feature_sample = self.features[idx] label_sample = self.labels[idx] # Create a sample dictionary (or tuple) sample = {'feature': feature_sample, 'label': label_sample} # Apply transformations if they exist if self.transform: sample = self.transform(sample) # Return the potentially transformed sample # Common practice is to return features and labels separately return sample['feature'], sample['label'] # Instantiate the dataset without transforms for now raw_dataset = SyntheticDataset(features, labels) # Test retrieving a sample sample_idx = 0 feature_sample, label_sample = raw_dataset[sample_idx] print(f"\nSample {sample_idx} - Feature: {feature_sample}") print(f"Sample {sample_idx} - Label: {label_sample}") print(f"Dataset length: {len(raw_dataset)}") # Output: 100At this point, raw_dataset holds our data and knows how to provide individual samples.Defining Data TransformationsOften, raw data isn't suitable for direct input into a neural network. We might need to normalize features, convert data types, or apply augmentations (especially for images). torchvision.transforms provides convenient tools for this. Even though our data isn't images, we can define custom transformations or use existing ones that operate on tensors.Let's define a simple transformation pipeline:Convert the features tensor to torch.float32 (good practice for model inputs).Convert the label tensor to torch.long (often required by loss functions like CrossEntropyLoss).Apply normalization to the features (subtract mean, divide by standard deviation). We'll calculate these stats from our synthetic dataset for this example.Since torchvision.transforms are primarily designed for images (PIL Image or Tensor), applying them directly to a dictionary like our sample requires a bit of wrapping. We'll create custom callable classes or lambda functions for this.# Calculate mean and std deviation for normalization (across the dataset) feature_mean = features.mean(dim=0) feature_std = features.std(dim=0) # Avoid division by zero if std dev is zero for any feature feature_std[feature_std == 0] = 1.0 # Define custom transform classes/functions for our dictionary sample format class ToTensorAndType(object): """Converts features to FloatTensor and labels to LongTensor.""" def __call__(self, sample): feature, label = sample['feature'], sample['label'] return {'feature': feature.float(), 'label': label.long()} class NormalizeFeatures(object): """Normalizes the feature tensor.""" def __init__(self, mean, std): self.mean = mean self.std = std def __call__(self, sample): feature, label = sample['feature'], sample['label'] # Apply normalization: (tensor - mean) / std normalized_feature = (feature - self.mean) / self.std return {'feature': normalized_feature, 'label': label} # Compose the transformations data_transforms = transforms.Compose([ ToTensorAndType(), NormalizeFeatures(mean=feature_mean, std=feature_std) ]) # Instantiate the dataset WITH the transformations transformed_dataset = SyntheticDataset(features, labels, transform=data_transforms) # Test retrieving a transformed sample sample_idx = 0 transformed_feature, transformed_label = transformed_dataset[sample_idx] print(f"\n--- Transformed Sample {sample_idx} ---") print(f"Original Feature:\n{features[sample_idx]}") print(f"Transformed Feature:\n{transformed_feature}") print(f"Original Label: {labels[sample_idx]} (dtype={labels.dtype})") print(f"Transformed Label: {transformed_label} (dtype={transformed_label.dtype})") # Verify normalization (mean should be close to 0, std close to 1 for the first sample's features) print(f"Transformed Feature Mean: {transformed_feature.mean():.4f}") # Should be near 0 after normalization applied across datasetNotice how the feature values have changed due to normalization, and the data types for both feature and label are now torch.float32 and torch.int64 (LongTensor) respectively.Using the DataLoaderThe final step is to use DataLoader. It takes our Dataset instance and handles batching, shuffling, and potentially parallel data loading.# Create the DataLoader batch_size = 16 # Process data in batches of 16 samples shuffle_data = True # Shuffle the data at the beginning of each epoch num_workers = 0 # Number of subprocesses to use for data loading. 0 means data loading happens in the main process. # On platforms other than Windows, you can often set num_workers > 0 for parallel loading # import os # if os.name != 'nt': # Check if not Windows # num_workers = 2 data_loader = data.DataLoader( transformed_dataset, batch_size=batch_size, shuffle=shuffle_data, num_workers=num_workers ) # Iterate through the DataLoader to get batches print(f"\n--- Iterating through DataLoader (batch_size={batch_size}) ---") # Get one batch feature_batch, label_batch = next(iter(data_loader)) print(f"Type of feature_batch: {type(feature_batch)}") print(f"Shape of feature_batch: {feature_batch.shape}") # Output: torch.Size([16, 10]) print(f"Shape of label_batch: {label_batch.shape}") # Output: torch.Size([16]) print(f"Data type of feature_batch: {feature_batch.dtype}") # Output: torch.float32 print(f"Data type of label_batch: {label_batch.dtype}") # Output: torch.int64 # You can loop through all batches like this (e.g., in a training epoch) # print("\nLooping through a few batches:") # for i, (batch_features, batch_labels) in enumerate(data_loader): # if i >= 3: # Show first 3 batches # break # print(f"Batch {i+1}: Features shape={batch_features.shape}, Labels shape={batch_labels.shape}") # # In a real training loop, you would feed batch_features to your model hereThe DataLoader yields batches where the first dimension corresponds to the batch_size. Our features batch has shape [16, 10], and the labels batch has shape [16]. The data types reflect the transformations we applied.Data Pipeline VisualizationWe can visualize the flow we just created:digraph G { rankdir=LR; node [shape=box, style=filled, fillcolor="#e9ecef", fontname="sans-serif"]; edge [fontname="sans-serif"]; RawData [label="Raw Data\n(features, labels tensors)", fillcolor="#a5d8ff"]; CustomDataset [label="Custom Dataset\n(SyntheticDataset)", fillcolor="#96f2d7"]; Transforms [label="Transformations\n(ToTensor, Normalize)", fillcolor="#ffec99"]; DataLoader [label="DataLoader\n(Batching, Shuffling)", fillcolor="#fcc2d7"]; Model [label="Model Input\n(Batched & Processed Data)", fillcolor="#bac8ff"]; RawData -> CustomDataset [label=" __init__ "]; CustomDataset -> Transforms [label=" __getitem__ applies "]; Transforms -> CustomDataset [style=dashed]; // Transforms are configured in Dataset CustomDataset -> DataLoader [label=" input dataset "]; DataLoader -> Model [label=" iterate for batches "]; }This diagram shows the progression from raw tensors to a custom Dataset, applying transformations during data retrieval (__getitem__), and finally using a DataLoader to produce shuffled batches suitable for model training.You have now successfully built a data pipeline using PyTorch's core data utilities. You created a Dataset to wrap your data, applied necessary transforms, and used a DataLoader to efficiently generate batches. This structured approach is fundamental for handling data in almost any PyTorch project, ensuring your models receive data in the correct format and facilitating efficient training. This pipeline is now ready to be integrated into the training loop we will construct in the next chapter.