A Neural Probabilistic Language Model, Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin, 2003Journal of Machine Learning Research, Vol. 3DOI: 10.1162/jmlr.2003.3.6.1137 - This paper introduced the concept of neural language models, using distributed word representations and neural networks for probability estimation.
Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - This foundational paper presented the LSTM architecture, which improved RNNs by addressing the vanishing gradient problem and enabling learning of longer sequences.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30, Vol. 30 (Curran Associates, Inc.)DOI: 10.48550/arXiv.1706.03762 - The paper that introduced the Transformer architecture, entirely based on attention mechanisms, which revolutionized sequence modeling and language models.
Speech and Language Processing, Daniel Jurafsky and James H. Martin, 2025 (Pearson) - A comprehensive textbook that covers traditional and neural language models, along with their application in speech recognition, incorporating recent developments like Transformers. (3rd edition draft)