Training and fine-tuning models with billions of parameters requires specialized operational practices distinct from standard machine learning workflows. The scale introduces significant challenges in computation, memory management, and coordination across potentially hundreds of accelerators.
This chapter focuses on the operational aspects of managing these large-scale training processes. You will learn how to:
By the end of this chapter, you will understand the core techniques and operational considerations for successfully training and fine-tuning large language models within an MLOps framework.
3.1 Orchestrating Distributed Training Jobs
3.2 Implementing Data Parallelism Strategies
3.3 Implementing Model Parallelism Strategies
3.4 Utilizing Frameworks like DeepSpeed and Megatron-LM
3.5 Operationalizing Parameter-Efficient Fine-tuning (PEFT)
3.6 Experiment Tracking for Large-Scale Runs
3.7 Checkpointing and Fault Tolerance Mechanisms
3.8 Hands-on Practical: Distributed Training Setup
© 2025 ApX Machine Learning