Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - The original paper introducing the Transformer architecture and Multi-Head Attention, detailing its mechanism and advantages.
The Annotated Transformer, Alexander Rush, Vincent Nguyen, Guillaume Klein, 2018 - A comprehensive, line-by-line explanation and PyTorch implementation of the Transformer model, including a clear breakdown of Multi-Head Attention.
Dive into Deep Learning, Aston Zhang, Zack C. Lipton, Mu Li, Alexander J. Smola, 2024 (Cambridge University Press) - An open-source, interactive textbook providing detailed explanations and code examples for deep learning concepts, with a dedicated section on Multi-Head Attention.