All Courses

Batching and Epochs in Model Training

When training a neural network, especially with large datasets, you rarely feed the entire dataset to the model in a single pass. Doing so would be computationally expensive and often inefficient for learning. Instead, the training process is structured around two important units of data organization: epochs and batches. Understanding these will help you configure your training loops effectively and manage computational resources.

Epochs: A Full Tour of Your Data

An epoch represents one complete pass through your entire training dataset. If you have 1,000 training samples, one epoch is completed when the model has seen and processed all 1,000 samples.

Neural networks learn iteratively. A single pass over the data (one epoch) is almost never sufficient for the model to learn the underlying patterns effectively. The model's parameters (weights and biases) are adjusted gradually. Therefore, training typically involves running for multiple epochs. Think of it like reading a textbook: you often need to go through the material several times to grasp the concepts fully.

The number of epochs is a hyperparameter you'll need to set.

Too few epochs can lead to underfitting, where the model hasn't learned enough from the data.
Too many epochs can, in some cases, contribute to overfitting, where the model learns the training data too well, including its noise, and performs poorly on unseen data. (We'll discuss overfitting and how to combat it later in this chapter.)

As the model trains over multiple epochs, you'll typically monitor its performance on a separate validation dataset to decide when to stop training.

Mini-Batches: Processing Data in Manageable Chunks

Processing an entire dataset at once, especially for datasets with millions of samples, can be demanding on memory (RAM and GPU VRAM) and can also lead to slower convergence. This is where mini-batches come in.

A mini-batch (often simply called a "batch") is a smaller, manageable subset of your training dataset. Instead of updating the model's weights after processing the entire dataset (which would be Batch Gradient Descent), you update them after processing each mini-batch.

For example, if your training dataset has 1,000 samples and you choose a batch size of 100, the dataset will be divided into $1000 / 100 = 10$ batches. The model will process the first 100 samples, calculate the loss, compute gradients, and update its weights. Then it will process the next 100 samples, update weights again, and so on, until all 10 batches (and thus, all 1,000 samples) have been processed. This completion of all batches constitutes one epoch.

Using mini-batches offers several advantages:

Memory Efficiency: Smaller batches require less memory, allowing you to train with large datasets that wouldn't fit into memory all at once. This is particularly important for GPU training, as GPUs often have more limited memory than CPU RAM.
Faster Updates and Potentially Faster Convergence: The model's weights are updated more frequently (after each batch rather than after the whole epoch). These more frequent, albeit slightly less accurate, updates can lead to faster convergence towards a good solution.
Noisier Gradient Estimates: The gradient calculated from a mini-batch is an approximation of the gradient that would be calculated from the entire dataset. This approximation introduces some "noise" into the training process. While it might sound counterintuitive, this noise can be beneficial. It can help the optimization process escape poor local minima and potentially find better, more generalizable solutions.

Iterations: The Basic Step of Learning

An iteration refers to a single update of the model's parameters. In the context of mini-batch gradient descent (the most common training strategy), one iteration corresponds to processing one mini-batch of data.

So, the relationship is:

An epoch consists of multiple iterations.
The number of iterations per epoch is determined by the total number of training samples ( $N$ ) and the batch size ( $B$ ). $\text{Iterations per Epoch} = \frac{N}{B}$ If $N$ is not perfectly divisible by $B$ , the last batch in an epoch might be smaller, or you might choose to drop it, depending on the data loading strategy.

For our example of 1,000 samples and a batch size of 100:

1 epoch = 10 iterations.
If you train for 50 epochs, the model will perform $50 \times 10 = 500$ parameter updates in total.

The diagram below illustrates how a dataset is processed in epochs and batches, leading to iterative model updates.

Relationship between the full dataset, epochs, batches, and iterations. An epoch involves processing all batches derived from the dataset, with each batch processing step constituting an iteration where the model updates its parameters.

Choosing a Batch Size

The batch size is another important hyperparameter that can significantly affect training dynamics and model performance. There's no one-size-fits-all answer, and the optimal batch size often depends on the dataset, model architecture, and available hardware.

Small Batch Sizes (e.g., 1, 8, 16, 32):
- Pros:
  - Provide a regularizing effect due to the noisy gradient updates, which can lead to better generalization.
  - Require less memory.
  - Can navigate out of "sharp" minima towards "flatter" minima, which often generalize better.
- Cons:
  - Training can be slower in terms of wall-clock time because the hardware might not be fully utilized (less parallelism).
  - The high variance in gradient estimates can make convergence erratic or slow if the learning rate is not tuned carefully.
- A batch size of 1 corresponds to Stochastic Gradient Descent (SGD).
Large Batch Sizes (e.g., 128, 256, 512+):
- Pros:
  - More accurate gradient estimates lead to smoother convergence.
  - Can take better advantage of hardware parallelism, leading to faster computation per epoch.
- Cons:
  - Require more memory.
  - May converge to sharp minima, which might not generalize as well to unseen data.
  - Can require more epochs to converge in terms of the number of parameter updates, as updates are less frequent.
- If the batch size equals the entire dataset size, it's called Batch Gradient Descent. This is rarely used in deep learning due to memory constraints and computational cost per update.

Commonly used batch sizes in deep learning range from 32 to 256, but this is highly empirical. It's often a good idea to experiment with different batch sizes. The batch size can also interact with other hyperparameters, such as the learning rate. For instance, when increasing the batch size, you might sometimes need to increase the learning rate as well to maintain similar training dynamics.

In Julia, libraries like MLUtils.jl provide tools like DataLoader to efficiently create and manage these batches from your dataset, which you'll then iterate over in your training loop. We touched upon MLUtils.jl in Chapter 3 when discussing data handling, and you'll see it in action as we construct full training loops.

By structuring your training process into epochs and batches, you gain fine-grained control over how your model learns from the data, balancing computational efficiency with learning effectiveness. Next, we'll see how these concepts fit into the overall model training loop.

Was this section helpful?