Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Provides theoretical background and implementation details for various optimization algorithms, including gradient descent, SGD, and momentum.
Adam: A Method for Stochastic Optimization, Diederik P. Kingma, Jimmy Ba, 20143rd International Conference for Learning RepresentationsDOI: 10.48550/arXiv.1412.6980 - Introduces the Adam optimizer and details its adaptive learning rate mechanism based on first and second moments of gradients.
Optimisers, Flux.jl Documentation, 2024 - Official guide for using optimizers within the Flux.jl framework, describing available algorithms and their API.