ZeRO: Memory Optimizations Towards Training Trillion Parameter Models, Samyam Anand, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Deepthi Karkada, Reza Yazdani Aminabadi, Ronald Pope, Sam Ade Jacobs, Yuxiong He, 2021SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (ACM)DOI: 10.1145/3458817.3476202 - Describes the ZeRO (Zero Redundancy Optimizer) family of memory optimization strategies, foundational for data parallelism in DeepSpeed and crucial for training large models.
DeepSpeed Documentation, DeepSpeed Team, 2025 (Microsoft) - The official online documentation providing comprehensive guides, API references, and tutorials for using DeepSpeed, including its advanced parallelism features and ZeRO optimizations.
Megatron-LM Documentation, NVIDIA, 2024 - The primary source for official documentation, examples, and ongoing development for NVIDIA's Megatron-LM framework, detailing its tensor and pipeline parallelism implementations.
Megatron-DeepSpeed: A Deep Learning Training System for Extreme Scale Model Training, Olatunji Ruwase, Samyam Anand, Shaden Smith, Jeff Rasley, Reza Yazdani Aminabadi, Yuxiong He, 2021 (Microsoft Research Blog) - This Microsoft Research blog post explicitly details the synergistic integration of Megatron-LM for model parallelism (TP/PP) and DeepSpeed for data parallelism (ZeRO) to achieve extreme-scale LLM training.