Choosing Optimizer Hyperparameters (lr, betas, eps, weight_decay)

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

Decoupled Weight Decay Regularization, Ilya Loshchilov and Frank Hutter, 2019 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1711.05101 - Introduces AdamW, which decouples weight decay from adaptive gradient updates, a standard for LLM optimization.
Adam: A Method for Stochastic Optimization, Diederik P. Kingma and Jimmy Ba, 2015 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1412.6980 - The original paper introducing the Adam optimizer, explaining its first and second moment estimates.
Scaling Laws for Neural Language Models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020 arXiv preprint DOI: 10.48550/arXiv.2001.08361 - Discusses how model performance scales with model size, dataset size, and compute, informing hyperparameter choices for LLMs.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He, 2017 arXiv preprint DOI: 10.48550/arXiv.1706.02677 - Demonstrates the linear scaling rule for learning rates with large batch sizes, relevant for LLM training.