Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30, Vol. 30 (Curran Associates, Inc.)DOI: 10.48550/arXiv.1706.03762 - The foundational paper that introduced the Transformer architecture and Multi-Head Attention, explaining its design and the rationale for processing information from different representation subspaces.
Dive into Deep Learning, Aston Zhang, Zachary C. Lipton, Mu Li, and Alex Smola, 2024 (Cambridge University Press) - An open-source educational resource that provides a clear, in-depth explanation of Multi-Head Attention, highlighting its role in allowing the model to focus on diverse aspects of the input simultaneously.
Natural Language Processing with Transformers: Building Language Models with Hugging Face, Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022 (O'Reilly Media) - A comprehensive book offering practical insights into Transformers, including a detailed pedagogical discussion of Multi-Head Attention's architecture and the benefits of its multi-perspective approach to understanding sequence relationships.