Saving and Loading Models, PyTorch Contributors, 2017 - Provides fundamental methods for saving and loading model states, optimizer states, and other components in PyTorch, which is foundational for understanding checkpointing implementations.
DeepSpeed: Advanced Checkpointing, DeepSpeed Team, 2024 - Discusses optimized checkpointing strategies in the DeepSpeed framework, which is widely used for large language model training and often involves efficient synchronous or asynchronous-like mechanisms.
Hydra: Understanding and Improving Distributed Checkpointing in Deep Learning, Karki, Hritik and Narayanasamy, Sanjeev and Shah, Nirmit and Chen, Bo and Wang, Yuandong and D'Sa, Renju and Agarwal, Sachin and Chen, Chien-Chung and Chintapalli, Srinivas, 2022SC'22: International Conference for High Performance Computing, Networking, Storage and Analysis (ACM)DOI: 10.1145/3550209.3552097 - This paper analyzes and proposes improvements for distributed checkpointing methods, including strategies that manage the trade-offs between consistency and performance, directly addressing the synchronous vs. asynchronous dilemma.
Tesseract: A Two-Level Checkpointing Protocol for Large-Scale Deep Learning, Kang, Yu and Zhang, Peifeng and Yang, Hong and Wang, Jiaqi and Zhu, Yanyuan and Wu, You and Zhang, Wei and Liu, Yong and Tian, Jin, 2023Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (Association for Computing Machinery)DOI: 10.1145/3575693.3575702 - Introduces a novel two-level checkpointing protocol designed for efficiency in large-scale deep learning, addressing challenges of both synchronous and asynchronous approaches by combining their best aspects.