Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - Provides comprehensive coverage of probability, information theory, and deep learning fundamentals, including softmax and cross-entropy.
Speech and Language Processing, Daniel Jurafsky, James H. Martin, 2020 (Stanford University) - A widely-used textbook for natural language processing, detailing statistical language modeling, the chain rule, and probabilistic foundations for NLP.
Elements of Information Theory, Thomas M. Cover, Joy A. Thomas, 2006 (John Wiley & Sons, Inc.)DOI: 10.1002/0471742762 - A classic text offering a rigorous mathematical treatment of information theory concepts such as entropy, cross-entropy, and KL divergence.
The Curious Case of Neural Text Degeneration, Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1904.09751 - Introduces nucleus sampling (Top-p) and discusses various advanced sampling strategies for text generation in neural language models.