Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - This seminal paper introduced the Transformer architecture and the Multi-Head Attention mechanism, providing foundational insights into the topic's origin and details.
Transformers and Self-Attention (Lecture Notes / Slide Deck), Tatsunori Hashimoto, 2023 (Stanford University) - Comprehensive lecture slides from a leading university course, offering a clear and accessible explanation of Multi-Head Attention within the context of Transformers.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2023 (Draft) - A widely respected textbook in natural language processing, offering a detailed and pedagogical explanation of Multi-Head Attention and its role in Transformer networks.