DeepSpeed: System Optimizations for Large-Scale Model Training, Samyam Rajbhandari, Cong Li, Zhun Liu, Guangxuan Xiao, Andreas Santarosa, Tiyab Sattar, Sheng Shen, Mao Ye, and Yuxiong He, 202014th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20) (USENIX)DOI: 10.5555/3446002.3446019 - Introduces the DeepSpeed library and its ZeRO memory optimization techniques for training large models.
Fully Sharded Data Parallel (FSDP), PyTorch Documentation, 2022 - Official documentation for PyTorch's native FSDP implementation, explaining its use for large-scale distributed training.
NVIDIA H100 GPU Architecture In-Depth, NVIDIA, 2022 (NVIDIA Technical Whitepaper) - Technical whitepaper providing architectural details of NVIDIA's Hopper H100 GPU, including Tensor Cores and NVLink interconnects.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A comprehensive textbook providing foundational knowledge of deep learning principles and algorithms.