Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and the self-attention mechanism, including the scaled dot-product attention formula.
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022 (O'Reilly Media) - Provides practical guidance on Transformer models, with discussions and examples of attention visualization.