A Neural Probabilistic Language Model, Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin, 2003Journal of Machine Learning Research, Vol. 3 - A foundational paper that introduces one of the earliest models for learning word embeddings and a neural network-based language model.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (Curran Associates, Inc.) - Introduces the Transformer architecture, which revolutionized sequence modeling and became the basis for modern large language models.
CS224N: Natural Language Processing with Deep Learning, Christopher Manning and Abigail See, 2023 (Stanford University) - An advanced university course providing lecture videos and notes covering word embeddings, recurrent neural networks, and modern transformer models.