Resuming Training from Checkpoints

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

Saving and Loading Models, Matthew Inkawhich, 2018 (PyTorch Foundation) - This official PyTorch tutorial details how to save and load model parameters, optimizer states, and other training components, directly supporting the implementation of checkpoint resumption.
DeepSpeed Checkpointing, Microsoft DeepSpeed Team, 2024 (DeepSpeed.ai) - The official DeepSpeed guide on checkpointing, essential for understanding how to save and resume training efficiently with sharded states in large-scale distributed environments.
Checkpointing in Accelerate, Hugging Face Team, 2024 (Hugging Face) - Hugging Face Accelerate offers a high-level API to simplify distributed training and checkpointing, providing a practical framework-level solution to many of the challenges discussed.
Decoupled Weight Decay Regularization, Ilya Loshchilov and Frank Hutter, 2019 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1711.05101 - This foundational paper introduces AdamW, an optimizer explicitly mentioned in the section, explaining its mechanism and the importance of its internal state for effective optimization.