Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - The foundational paper introducing the Transformer architecture and the Scaled Dot-Product Attention mechanism.
The Annotated Transformer, Alexander Rush, 2018 - A widely referenced PyTorch implementation and detailed explanation of the Transformer model, including Scaled Dot-Product Attention.
Dive into Deep Learning, Aston Zhang, Zack C. Lipton, Mu Li, Alex Smola, 2021 (Cambridge University Press) - An authoritative interactive deep learning book providing comprehensive coverage of attention mechanisms and Transformer architectures with executable code.
MultiheadAttention, PyTorch Authors, 2024 (PyTorch Foundation) - Official PyTorch documentation for the MultiheadAttention module, which internally uses Scaled Dot-Product Attention, providing insights into its practical usage and parameters.