Speech and Language Processing (3rd Edition), Daniel Jurafsky and James H. Martin, 2025 - This is an ongoing draft of a widely-used textbook. It covers various aspects of speech and language processing, including comprehensive sections on Text-to-Speech synthesis fundamentals, text normalization, linguistic analysis, and the evolution of TTS systems.
The HMM-based Speech Synthesis System (HTS), Keiichi Tokuda, Takashi Zen, Yoshihiko Nankaku, 20027th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH 2002 (ISCA) - This paper introduces the foundational HMM-based framework for speech synthesis, which was a dominant parametric TTS approach for many years. It details the statistical modeling of spectral features and F0.
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu, 2018ICASSPDOI: 10.48550/arXiv.1712.05884 - This paper introduces Tacotron 2, a significant end-to-end neural text-to-speech system that generates natural-sounding speech by predicting mel-spectrograms which are then converted to audio by a WaveNet vocoder. It demonstrates a major step towards simpler and higher-quality neural TTS.