Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition, William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, 2016IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE)DOI: 10.1109/ICASSP.2016.7472659 - This seminal paper introduces the Listen, Attend and Spell (LAS) model, a foundational end-to-end ASR architecture, and demonstrates the effectiveness of incorporating an external language model through shallow fusion for improved recognition performance.
Streaming End-to-End ASR with Shallow-Fused Contextual Biasing, Wenkuan Fang, Jiahong Yuan, Yu Zhang, and Shinji Watanabe, 2019Interspeech (ISCA (International Speech Communication Association))DOI: 10.21437/Interspeech.2019-2646 - This paper details a practical shallow fusion approach for integrating external language models, combined with contextual biasing, into streaming end-to-end ASR systems, showcasing its impact on recognition performance.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - This authoritative textbook provides a comprehensive theoretical background for speech recognition, covering fundamental concepts of acoustic and language modeling, and the mathematical framework for combining them during decoding.