Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and its masked multi-head self-attention mechanism, essential for the decoder's auto-regressive generation.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - A comprehensive textbook providing detailed explanations of Transformer models, including the function and implementation of masked self-attention.
Stanford CS224N: Natural Language Processing with Deep Learning, Christopher Manning, Abigail See, John Hewitt, Tatsunori Hashimoto, 2023 (Stanford University) - Course materials offering a thorough academic perspective on Transformer architectures and the role of masked attention in the decoder.
The Hugging Face Course: Transformers, Hugging Face, 2023 (Hugging Face) - An accessible and practical online course that explains the Transformer architecture and the specific role of masked attention in a clear, interactive manner.