Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - The original research paper that introduced the Transformer architecture and the Scaled Dot-Product Attention mechanism.
Transformers for Natural Language Processing, Loïc Rio, Sylvain Gugger, Thomas Wolf, 2022 (O'Reilly Media) - A practical and comprehensive guide to Transformer models, including the underlying attention mechanisms.
MultiheadAttention - PyTorch documentation, PyTorch Core Team, 2024 (PyTorch Foundation) - Official documentation for PyTorch's attention module, relevant for understanding practical implementation and parameters of attention layers.