All Courses

Advanced Speech Recognition and Synthesis

Chapter 1: Foundations of Modern Speech Processing Pipelines

Advanced Audio Feature Extraction

Statistical Modeling Review for Speech

Deep Learning Architectures for Sequences

Components of ASR Systems

Components of TTS Systems

Evaluation Metrics Revisited

Chapter 2: Advanced Acoustic Modeling for ASR

Hybrid HMM-DNN Systems

Connectionist Temporal Classification (CTC)

Attention-Based Encoder-Decoder Models

RNN Transducer (RNN-T)

Transformer Architectures for ASR

Advanced Training Techniques

Decoding Algorithms Comparison

Hands-on Practical: Building an End-to-End ASR Model

Chapter 3: Language Modeling and Adaptation in ASR

Neural Language Models for ASR

Shallow Fusion and Deep Fusion

Speaker Adaptation Techniques

Environment and Channel Adaptation

Unsupervised and Semi-Supervised Learning for ASR

Multi-Lingual and Cross-Lingual ASR

Practice: Fine-tuning ASR with Adaptation Data

Chapter 4: Advanced Text-to-Speech Synthesis

Autoregressive Acoustic Models (Tacotron, Transformer TTS)

Non-Autoregressive Acoustic Models (FastSpeech, ParaNet)

Flow-Based Models for TTS

Generative Adversarial Networks (GANs) in TTS

Prosody Modeling and Control

Expressive Speech Synthesis

Voice Cloning and Conversion

Hands-on Practical: Training an Advanced TTS Model

Chapter 5: Neural Vocoders and Waveform Generation

Limitations of Traditional Vocoders

Autoregressive Waveform Models (WaveNet, WaveRNN)

Flow-Based Vocoders (WaveGlow, FloWaveNet)

GAN-Based Vocoders (MelGAN, HiFi-GAN)

Diffusion Models for Vocoding

Conditioning Neural Vocoders

Evaluation of Synthesized Audio Quality

Hands-on Practical: Using a Neural Vocoder

Chapter 6: Optimization, Deployment, and Toolkits

Model Quantization for Speech Models

Model Pruning and Sparsification

Knowledge Distillation for ASR/TTS

Optimized Inference Engines (ONNX Runtime, TensorRT)

Deployment Considerations for Streaming ASR

Deployment Considerations for Real-Time TTS

Overview of Speech Processing Toolkits (ESPnet, NeMo, Coqui)

Practice: Optimizing a Speech Model

Voice Cloning and Conversion

Was this section helpful?

References

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis, Yongqiang Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Ye Jia, Patrick Nguyen, Heiga Zen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, 2018 Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Neural Information Processing Systems Foundation, Inc. (NeurIPS)) - A foundational paper demonstrating zero-shot voice cloning by leveraging speaker embeddings from a pre-trained speaker verification model to condition a multi-speaker Tacotron-based TTS system.
Autovc: Zero-Shot Voice Style Transfer with Only Autoencoder-Based Voice Conversion, Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Mark Hasegawa-Johnson, 2019 International Conference on Machine Learning, Vol. 97 (PMLR (Proceedings of Machine Learning Research)) - Presents AutoVC, a significant architecture for zero-shot voice conversion that disentangles speaker identity, content, and prosody using an autoencoder framework.
FiLM: Visual Reasoning with a General Conditioning Layer, Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, Aaron Courville, 2018 Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32(1) (Association for the Advancement of Artificial Intelligence) DOI: 10.1609/aaai.v32i1.11671 - Introduces Feature-wise Linear Modulation (FiLM), a general conditioning method widely applied in neural networks, including its use for integrating speaker embeddings in advanced speech synthesis models.

© 2025 ApX Machine Learning