Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30DOI: 10.48550/arXiv.1706.03762 - The foundational paper introducing the Transformer architecture, multi-head attention, and positional encoding, which are core to its implementation.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 (Stanford University) - A comprehensive textbook on natural language processing, including extensive coverage of the Transformer architecture, its components, and training procedures.
Stanford CS224N: Natural Language Processing with Deep Learning, Diyi Yang, Tatsunori Hashimoto, 2025 (Stanford University) - A university course that covers the theoretical background and practical aspects of building and training neural NLP models, including detailed sessions on Transformers.