Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - The original paper that introduced the Transformer architecture and the Multi-Head Attention mechanism.
MultiheadAttention, PyTorch team, 2024 - Official documentation for PyTorch's implementation of Multi-Head Attention, useful for understanding its parameters and usage in practice.
Stanford CS224N: Natural Language Processing with Deep Learning, Course Notes, Chris Manning, John Hewitt, and Stanford CS224N Course Staff, 2023 (Stanford University) - Comprehensive course notes from a leading NLP course, with Chapter 7 dedicated to the Transformer architecture and its components like Multi-Head Attention.