Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - The foundational paper introducing the Transformer model, detailing its architecture, training objectives including cross-entropy loss, and the use of label smoothing.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - An authoritative textbook providing a comprehensive explanation of cross-entropy loss, its mathematical underpinnings, and its application in various deep learning models.
torch.nn.CrossEntropyLoss, PyTorch Contributors, 2023 (PyTorch Foundation) - Official documentation for PyTorch's implementation of cross-entropy loss, detailing parameters like ignore_index for handling padding in sequence tasks.