Optimizing PyTorch Data Pipelines: From Bottlenecks to 39× Speedups

An examination of how data pipeline design impacts PyTorch training performance, supported by simple experiments and benchmarks.

Introduction#

Today I decided to revisit the basic PyTorch training loop before diving into experimenting with some new computer vision algorithms. The main goal was to check if my previous understanding and habits are still effective, or if I'm missing anything in the current landscape. To be frank, sometimes we use "packaged" frameworks (like PyTorch Lightning, Hugging Face Accelerate) so much that we forget what's really happening underneath.

To perform these experiments, I'm using a personal workstation with a pretty powerful configuration, designed for computation and AI tasks:

CPU: Intel i9-14900K (32 cores) @ 5.700GHz
GPU: NVIDIA RTX A4000 (16GB VRAM)
RAM: 32 GB

A powerful setup like this, especially the GPU, will make any slowness from the CPU or I/O pipeline (data loading) more obvious. If the GPU has to wait for the CPU to prepare data, we will immediately see "GPU starvation".

Fundamentally, a standard training loop for a supervised learning problem in PyTorch is quite simple. It's illustrated by the code snippet below:

# Example for a supervised learning algorithm with labels
for epoch in range(10):
    for batch_idx, (x, y) in enumerate(train_loader):
        # Each loop processes one mini-batch:
        #   x: input tensor [batch_size, ...]
        #   y: labels for the mini-batch
        #   batch_idx: index of the mini-batch
 
        # 1. Reset gradients from the previous loop
        optimizer.zero_grad()                    
        
        # 2. Forward pass: Pass data through the model
        predictions = model(x)                    
        
        # 3. Calculate loss
        loss = loss_function(predictions, y)      
        
        # 4. Backward pass: Calculate gradients
        loss.backward()                           
        
        # 5. Update weights
        optimizer.step()

Experimental datasets#

In the experimental section, I use two classic datasets in computer vision: MNIST and FashionMNIST.

MNIST: The handwritten digit dataset published by Yann LeCun, consisting of 10 classes corresponding to digits 0 through 9.
FashionMNIST: Proposed as a more challenging drop-in replacement for MNIST, consisting of images of fashion products like t-shirts, trousers, shoes, bags, etc.

For visual illustration, Figure 1 shows several typical image samples from FashionMNIST, while Figure 2 presents the corresponding samples from MNIST. Each image is in 28×28 grayscale format, demonstrating simplicity yet containing sufficient information for basic classification tasks.

Figure 1: Some sample images from the FashionMNIST dataset.

Figure 2: Some sample images from the MNIST dataset.

Both datasets share common characteristics:

60,000 images for training and 10,000 images for testing.
Grayscale images with dimensions 28×28.
The goal is to recognize and classify the images into one of the 10 available labels.

Thanks to their compact size and ease of processing, MNIST and FashionMNIST are often considered the "Hello World" of Computer Vision problems. They allow us to focus on optimizing the computation pipeline and data loading mechanism, without being dominated by complex data transformations or large I/O costs from disk.

An interesting discovery: CSV vs. Parquet#

At this point, I noticed the .csv files I was storing (each row is 784 pixels + 1 label) were quite large. I suddenly remembered reading somewhere about Parquet and decided to try storing the dataset in this format.

Oh, and the results were really interesting: significantly smaller file sizes and much faster read/write speeds.

Dataset	train.csv (MB)	test.csv (MB)	train.parquet (MB)	test.parquet (MB)
MNIST	109.6	18.3	18.1 (~6.0x)	3.8 (~4.8x)
FashionMNIST	133	22.2	37.6 (~3.5x)	7.3 (~3.0x)

The reason behind this efficiency:

Columnar Storage: Parquet stores data by column, while CSV stores by row. When data is stored by column, values in the same column (e.g., pixel_10, pixel_11...) often have the same data type and similar properties. This leads to much better compression.
Efficient Compression and Encoding: Parquet supports multiple compression codecs (Snappy, Gzip, Zstandard...) and special encoding techniques for each column. Because it's easier to find patterns in data of the same type within a column, compression is significantly more effective.

Downside: Parquet is a binary format. You can't open it with a text editor and read it like a CSV file. But for large data, this is a completely worthwhile trade-off.

Naive Implementation#

Ok, now let's get into the implementation. We need a custom Dataset class to read the Parquet file and a simple CNN model. I fixed random_seed = 42 to ensure consistent results.

Dataset Class (Version 1)#

This is the first "naive" implementation. The idea is to load the entire Parquet file into RAM in __init__, then the __getitem__ function will be responsible for retrieving a row (.iloc), processing it, and transforming it into a Tensor.

class MNISTDataset(Dataset):
    def __init__(self, parquet_path: str, num_classes: int = 10):
        # Load the entire file into RAM
        self.df = pd.read_parquet(parquet_path)
        self.label_col = 'label' if 'label' in self.df.columns else None
        self.feature_cols = [c for c in self.df.columns if c != self.label_col]
        self.num_classes = num_classes
 
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        # 1. Get data by row
        row = self.df.iloc[idx]
        
        # 2. Process features
        pixels = row[self.feature_cols].to_numpy(dtype=np.float32)
        image = torch.tensor(pixels.reshape(1, 28, 28)) / 255.0
 
        # 3. Process label
        if self.label_col:
            # Return label as an integer (index)
            label = torch.tensor(int(row[self.label_col]), dtype=torch.long)
        else:
            label = torch.tensor(-1, dtype=torch.long) # For the test case (no label)
 
        return {
            "image": image,          # Tensor [1, 28, 28]
            "label": label,          # Tensor [1]
        }

Tiny Neural Network#

This model is just a basic CNN to give the GPU something to compute.

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)   # [B, 1, 28, 28] → [B, 16, 28, 28]
        self.pool  = nn.MaxPool2d(2, 2)               # [B, 16, 28, 28] → [B, 16, 14, 14]
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)  # [B, 16, 14, 14] → [B, 32, 14, 14]
        self.fc1   = nn.Linear(32 * 7 * 7, 128)
        self.fc2   = nn.Linear(128, num_classes)
 
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)  # flatten
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

This model has 206,922 parameters (a tiny network).

Hyperparameters#

I'll try the FashionMNIST dataset first, with the following config:

train_dataset = MNISTDataset("data/FashionMNIST/train.parquet")
test_dataset = MNISTDataset("data/FashionMNIST/test.parquet")
 
train_dataloader = DataLoader(
    train_dataset,
    batch_size=16,
    shuffle=True,
)
 
test_dataloader = DataLoader(
    test_dataset,
    batch_size=16,
    shuffle=False,
)
 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

I'll start by training for 5 epochs.

Note: I should have split the train set into train and val sets to evaluate on the val set, not the test set. But for now, I'm not worried about it. Also, calling model.train() and model.eval() is very important (especially when using Dropout or BatchNorm), but with this simple model, I'll skip it for now.

Ok, and the log for batch_size=16:

Average Train Time per Epoch: 17.13s
Average Test Time per Epoch:  2.39s
Total Train Time: 85.64s
Total Test Time:  11.95s

Hmm, so one epoch takes about 19 seconds on average (17s train + 2s test).

Analyzing the Pipeline Bottleneck#

A very natural thought is: increase the batch size to speed up computation - as I often do. My RTX A4000 GPU is idling with batch_size=16. I'll try increasing the batch_size to 32, 64, 128, 256, 512.

At this point, something strange happened. The speed did increase a bit (to about 16 seconds/epoch on average), but from batch_size=32 all the way up to batch_size=512, the time did not change significantly.

Here is the log for batch_size=512:

Average Train Time per Epoch: 13.67s
Average Test Time per Epoch:  2.17s
Total Train Time: 68.36s
Total Test Time:  10.87s

It's only slightly faster, which is negligible. What on earth is going on? - I thought to myself.

Then I immediately realized the problem, and I have my CUDA studies to thank for my quicker reaction time compared to the past. I guessed right away that the problem wasn't in the compute (GPU) but in the data pipeline.

More specifically: the data isn't loading fast enough to keep up with the GPU's computation speed. A classic bottleneck. The GPU utilization was very low at this point; it was spending most of its time waiting for the CPU to deliver the next batch.

Ok, I started to carefully examine the code above. I realized there were two potential areas for improvement:

The attributes of the DataLoader.
A serious bottleneck inside the __getitem__ function of the Dataset class.

Note: My previously implemented Dataset class loads the entire data (Parquet file) into RAM (stored in self.df). In reality, if the data is large (hundreds of GB), this is impossible. In that case, you'd have to split the file or stream the data. But in this case, the data is small enough to fit in RAM.

Optimizing DataLoader to Increase Throughput#

I started experimenting with the DataLoader first, keeping batch_size=512.

num_workers#

num_workers is the number of subprocesses that DataLoader will spawn to load data in parallel. If not specified (default is 0), the data is loaded in the main process. This is the bottleneck! The main process is busy coordinating the GPU and also has to do the work of loading data. So, in theory, if we increase the number of workers, the speed will increase. I'll try running with num_workers=4 (my CPU has 32 cores, but 4 is a reasonable starting point).

train_dataloader = DataLoader(
    train_dataset,
    batch_size=512,
    shuffle=True,
    num_workers=4
)
 
test_dataloader = DataLoader(
    test_dataset,
    batch_size=512,
    shuffle=False,
    num_workers=4
)

Wow! The result was astonishing. The average time per epoch is now only about 4.3 seconds, a decrease of nearly 4 times compared to before (16 seconds).

Average Train Time per Epoch: 3.68s
Average Test Time per Epoch:  0.67s
Total Train Time: 18.39s
Total Test Time:  3.37s

Ok, now the question is what happens if we increase the number of workers (e.g., 8, 16)? In reality, the Intel i9-14900K has 32 cores, but when I tried increasing the workers, the phenomenon was similar to the batch size: the speed didn't decrease significantly. It seems num_workers=4 was already enough to saturate the pipeline. I'll fix the config at num_workers=4.

persistent_workers#

In DataLoader, this parameter is False by default. When it's False, the DataLoader operates as follows:

Start of epoch $\rightarrow$ create num_workers processes.
Workers load batches.
End of epoch $\rightarrow$ kill all workers.
Next epoch $\rightarrow$ spawn new workers from scratch.

The overhead here is: the time to create new processes, reload the dataset file, reload libraries (numpy/pandas...).

If persistent_workers=True, the workers will be kept alive across epochs. The benefit is mainly seen with a large number of epochs, but I'll still try it with 5 epochs.

Average Train Time per Epoch: 3.61s
Average Test Time per Epoch:  0.62s
Total Train Time: 18.06s
Total Test Time:  3.09s

Hmm, not very significant.

pin_memory#

Next, let's talk about pin_memory. This attribute is very important, directly related to how PyTorch transfers data from RAM (CPU) to VRAM (GPU) more quickly and efficiently.

To understand what pin_memory=True does, we need to know about two types of CPU memory:

Pageable Memory: When we create a tensor or any variable, it is stored in pageable memory by default. This is memory that the Operating System (OS) has full control over. If the OS sees that RAM is low, it can take that block of data and swap it to the hard drive to free up RAM for other tasks.
Page-Locked / Pinned Memory: This is a special type of memory in the CPU's RAM. When we "pin" a memory region, we are telling the OS: "Absolutely do not move or swap this data region to the hard drive." It is locked at a fixed physical address in RAM.

Where is the problem?

Data transfer from CPU to GPU uses DMA (Direct Memory Access) to achieve the highest speed. DMA requires the data to have a fixed physical address in RAM. It cannot work efficiently with Pageable Memory because the OS can move that data at any time, forcing the GPU to wait or perform an intermediate copy step.

When we set pin_memory=True in the DataLoader, it will automatically load the data batches into Pinned Memory (instead of Pageable Memory). Because this memory region is fixed, the GPU's DMA mechanism can access and copy it directly to VRAM, eliminating latency and significantly increasing transfer speed.

Ok, that's the theory. I'll try turning it on (along with num_workers=4 and persistent_workers=True):

train_dataloader = DataLoader(
    train_dataset,
    batch_size=512,
    shuffle=True,
    num_workers=4,
    persistent_workers=True,
    pin_memory=True
)
 
test_dataloader = DataLoader(
    test_dataset,
    batch_size=512,
    shuffle=False,
    num_workers=4,
    persistent_workers=True,
    pin_memory=True
)

And the result:

Average Train Time per Epoch: 3.64s
Average Test Time per Epoch:  0.62s
Total Train Time: 18.19s
Total Test Time:  3.09s

Unfortunately, while the theory is grand, it had almost no impact in this case (the time is nearly identical to when it was off).

The reason might be that our data is too small (28x28 images) and the model is also too small. The cost of pinning the memory is almost equal to the benefit it provides. However, for large models and heavy input data (like high-resolution images), this is a parameter that must be enabled to optimize speed.

Refactor Dataset to Reduce CPU Cost#

Ok, next I'll talk about another important optimization point. After using num_workers, I removed the I/O bottleneck, but now the bottleneck has shifted to CPU processing.

I noticed the code inside __getitem__ of the Dataset class (v1) has a very serious problem:

# Inside __getitem__(self, idx)
row = self.df.iloc[idx]
pixels = row[self.feature_cols].to_numpy(dtype=np.float32)
image = torch.tensor(pixels.reshape(1, 28, 28)) / 255.0

All of these operations: iloc (DataFrame access), to_numpy, torch.tensor, reshape, and the division by 255.0... all of them are performed every time an item is called.

Although self.df is already in RAM (not I/O bound), this processing is repeated 60,000 times per epoch (performed by 4 num_workers). This is a huge waste of CPU resources.

Solution: If we've already decided to load everything into RAM, why not perform all the transformation steps once at initialization time (__init__)?

Then, __getitem__ will only have the lowest complexity of $O(1)$ (retrieving a pre-processed element from an array/tensor).

Dataset Class (Version 2)#

I re-implemented it as follows:

class MNISTDataset(Dataset):
    def __init__(self, parquet_path: str, num_classes: int = 10):
        df = pd.read_parquet(parquet_path)
        self.label_col = 'label' if 'label' in df.columns else None
        self.feature_cols = [c for c in df.columns if c != self.label_col]
        self.num_classes = num_classes
 
        # Process all features ONCE
        features_np = df[self.feature_cols].to_numpy(dtype=np.float32)
        # Reshape and convert to Tensor ONCE
        self.images = torch.from_numpy(features_np.reshape(-1, 1, 28, 28)) / 255.0
 
        # Process all labels ONCE
        if self.label_col:
            labels_np = df[self.label_col].to_numpy(dtype=np.int64)
            self.labels = torch.from_numpy(labels_np).long()
        else:
            self.labels = torch.zeros(len(df), dtype=torch.long)
 
    def __len__(self):
        # Return the stored length
        return len(self.images)
    
    def __getitem__(self, idx):
        # Simply access by index (extremely fast)
        image = self.images[idx]
        label = self.labels[idx]
        return {
            "image": image, 
            "label": label
        }

Result#

Now, let's run again with the Dataset version 2, and keep the optimized DataLoader configuration (bs=512, num_workers=4, pin_memory=True).

Wow, and the result is truly surprising, a reduction of nearly 9 times compared to the previous DataLoader optimization.

Average Train Time per Epoch: 0.44s
Average Test Time per Epoch:  0.06s
Total Train Time: 2.21s
Total Test Time:  0.29s

An epoch now only takes 0.5 seconds (0.44s train + 0.06s test) - an astonishing result.

Let's summarize the entire process:

Method	Avg Train Time/Epoch	Avg Test Time/Epoch	Total Time (5 epochs)	Speedup (vs. Naive)
Naive	17.13s	2.39s	97.60s	1x
`num_workers`	3.68s	0.67s	21.75s	~4.5x
`persistent_workers`	3.61s	0.62s	21.15s	~4.6x
`pin_memory`	3.64s	0.62s	21.30s	~4.6x
`__getitem__`	0.44s	0.06s	2.50s	~39x

Conclusion#

It's fascinating that we managed to speed up the training by about 39 times compared to the initial naive approach. It completely made me forget that I haven't even touched the MNIST dataset yet, but oh well, I'll test both datasets with new vision algorithms in future posts. Hopefully, I'll discover another way to speed things up even more. ¹

views

— views

Nguyen Xuan Hoa

nguyenxuanhoakhtn@gmail.com