torch.distributed - PyTorch 2.3 documentation, PyTorch Development Team, 2017 - Official documentation for PyTorch's distributed package, providing details on collective communication operations like dist.barrier() and principles of distributed data parallelism.
DeepSpeed: Saving/Loading Checkpoints, Microsoft DeepSpeed Team, 2025 - Official DeepSpeed tutorial demonstrating how to effectively save and load sharded checkpoints, particularly relevant for large-scale model training with memory optimization techniques like ZeRO.
ZeRO: Memory Optimizations Towards Training Trillion-Parameter Models, Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He, 2020SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisDOI: 10.5555/3433701.3433785 - Introduces the Zero Redundancy Optimizer (ZeRO), which partitions model states, gradients, and optimizer states across devices, fundamentally enabling the scalable sharded checkpointing discussed.
Megatron-LM: Training Large, Powerful Models Using Model Parallelism, Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2020arXiv preprint arXiv:1909.08053DOI: 10.48550/arXiv.1909.08053 - Describes the Megatron-LM framework, detailing tensor and pipeline parallelism strategies, which inherently lead to sharded model states across devices and necessitate distributed checkpointing.