All Courses

Implementing Data Parallelism Strategies

When training large language models, processing the vast amounts of required data on a single accelerator quickly becomes infeasible. Data parallelism offers a fundamental strategy to distribute this workload across multiple processing units, typically GPUs or TPUs, allowing you to train models faster and handle larger effective batch sizes.

The core idea behind data parallelism is simple: replicate the entire model on each available worker (device), feed each worker a different slice (shard) of the input data batch, and then combine the results after each step. Let's break down the typical workflow:

Model Replication: At the beginning of training (or after a parameter update), ensure every worker possesses an identical copy of the model's parameters ( $\theta$ ).
Data Sharding: Divide the current global mini-batch of training data ( $B$ ) into smaller, distinct mini-batches ( $b_i$ ), where $i$ is the index of the worker. Each worker $i$ receives its specific mini-batch $b_i$ .
Local Computation: Each worker $i$ performs a forward pass using its model copy ( $\theta$ ) and its local data shard ( $b_i$ ) to compute the loss ( $L_i$ ). Subsequently, it performs a backward pass to calculate the gradients of the loss with respect to its model parameters ( $\nabla L_i(\theta)$ ). This happens independently and concurrently on all workers.
Gradient Synchronization: This is the coordination step. The gradients computed locally on each worker ( $\nabla L_i(\theta)$ ) need to be aggregated across all workers to obtain the gradient based on the global mini-batch ( $\nabla L(\theta)$ ). The most common method for this in modern LLM training is the AllReduce collective communication operation.

AllReduce: AllReduce sums the gradients from all workers and distributes the result (the sum, or often the average) back to all workers. Each worker $i$ sends its $\nabla L_i(\theta)$ and receives the final $\nabla L(\theta) = \frac{1}{N} \sum_{j=1}^{N} \nabla L_j(\theta)$ , where $N$ is the number of workers. This ensures all workers have the same gradient for the update step. Efficient implementations (like ring-AllReduce) minimize communication bottlenecks, but communication cost still generally scales with the number of parameters in the model.

Parameter Update: Each worker uses the synchronized gradient ( $\nabla L(\theta)$ ) to update its local copy of the model parameters ( $\theta$ ) using the chosen optimizer (e.g., AdamW). Since all workers start with the same $\theta$ and receive the same $\nabla L(\theta)$ , their parameter copies remain synchronized after the update.

Data parallelism workflow: The global batch is split, each worker computes gradients on its shard using a model replica. Gradients are synchronized via AllReduce, and each worker updates its model copy identically.

Implementation Frameworks and Considerations

Frameworks like PyTorch (DistributedDataParallel or DDP), TensorFlow (tf.distribute.Strategy), and Horovod abstract away much of the complexity of implementing data parallelism. DeepSpeed also builds upon these concepts, adding further optimizations.

A typical structure using PyTorch's DDP might look like this (simplified):

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    # Initialize the process group (e.g., using NCCL backend for GPUs)
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train_step(rank, world_size, model, data_loader, optimizer, criterion):
    model.train()
    # DDP automatically handles data sharding if DataLoader uses DistributedSampler
    for data, target in data_loader:
        data, target = data.to(rank), target.to(rank) # Move data to worker's device

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)

        # loss.backward() computes local gradients
        loss.backward()

        # DDP automatically triggers AllReduce during backward pass
        # Gradients are synchronized across workers here

        # optimizer.step() updates local model parameters using synchronized gradients
        optimizer.step()

def main_worker(rank, world_size, model_definition, dataset):
    setup(rank, world_size)

    model = model_definition().to(rank)
    # Wrap the model with DDP
    ddp_model = DDP(model, device_ids=[rank])

    # Use DistributedSampler for the DataLoader
    sampler = torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    data_loader = torch.utils.data.DataLoader(dataset, batch_size=local_batch_size, sampler=sampler)

    optimizer = optim.AdamW(ddp_model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss() # Example criterion

    for epoch in range(num_epochs):
        train_step(rank, world_size, ddp_model, data_loader, optimizer, criterion)
        # Add validation, checkpointing etc.

    cleanup()

# --- Main execution logic to spawn processes ---
# if __name__ == "__main__":
#     world_size = torch.cuda.device_count()
#     mp.spawn(main_worker, args=(world_size, model_def, dataset), nprocs=world_size, join=True)

Important operational points arise when using data parallelism at scale:

Communication Overhead: The AllReduce step introduces communication dependency. Its cost increases with model size (more parameters to synchronize) and can be limited by the interconnect bandwidth between nodes (e.g., NVLink, InfiniBand, Ethernet). For multi-billion parameter models, this synchronization can become a significant portion of the step time.
Batch Size Implications: Data parallelism effectively increases the global batch size: $\text{Global Batch Size} = \text{Local Batch Size per Worker} \times \text{Number of Workers}$ . Training dynamics can change with very large batch sizes. Often, the learning rate needs to be scaled (e.g., linear scaling rule) and convergence might require adjustments to optimizer hyperparameters (like beta values in Adam or momentum).
Gradient Accumulation: When the model fits on a single worker but the desired global batch size requires more memory than available per worker for the local_batch_size, gradient accumulation is used. Workers process multiple smaller "micro-batches" sequentially, accumulating gradients locally before performing the AllReduce and optimizer step. This simulates a larger local batch size without increasing memory requirements, trading off computation time for memory. The AllReduce synchronization happens only once per N micro-batches, where N is the number of accumulation steps.
Memory Constraint: The primary limitation of pure data parallelism is that the entire model must fit into the memory of each worker. This becomes prohibitive for the largest LLMs (e.g., >100B parameters on typical current GPUs).

Data parallelism is often the starting point for distributed training due to its relative simplicity and effectiveness, especially when computation per data point is high. However, when models grow too large for a single device's memory, you must combine data parallelism with model parallelism techniques, which we will discuss next.

Was this section helpful?