Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ćukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (NeurIPS)DOI: 10.5555/3295222.3295349 - Introduces the Transformer architecture and the Multi-Head Attention mechanism, which is the basis of modern LLMs.
MultiheadAttention, PyTorch Authors, 2024 - Official documentation for PyTorch's built-in MultiheadAttention module, useful for understanding standard implementation patterns and usage.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - A standard textbook that offers a comprehensive academic treatment of Transformers and attention mechanisms within the context of natural language processing.
The Annotated Transformer, Alexander Rush, Vincent Nguyen, Guillaume Klein, 2018 - A highly regarded detailed explanation and PyTorch implementation of the Transformer model, serving as an excellent guide to the original paper's code.