Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017), Vol. 30DOI: 10.48550/arXiv.1706.03762 - The original paper introducing the Transformer model, detailing its architecture including the final linear projection and softmax for output generation.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A comprehensive textbook covering fundamental deep learning components such as linear transformations and softmax activation functions used for classification.
Stanford CS224N: Natural Language Processing with Deep Learning, Diyi Yang, Tatsunori Hashimoto, 2023 (Stanford University) - Provides educational materials on deep learning for NLP, often covering the application of linear layers and softmax in the output layers of sequence models.