All Courses

Advanced Speech Recognition and Synthesis

Chapter 1: Foundations of Modern Speech Processing Pipelines

Advanced Audio Feature Extraction

Statistical Modeling Review for Speech

Deep Learning Architectures for Sequences

Components of ASR Systems

Components of TTS Systems

Evaluation Metrics Revisited

Chapter 2: Advanced Acoustic Modeling for ASR

Hybrid HMM-DNN Systems

Connectionist Temporal Classification (CTC)

Attention-Based Encoder-Decoder Models

RNN Transducer (RNN-T)

Transformer Architectures for ASR

Advanced Training Techniques

Decoding Algorithms Comparison

Hands-on Practical: Building an End-to-End ASR Model

Chapter 3: Language Modeling and Adaptation in ASR

Neural Language Models for ASR

Shallow Fusion and Deep Fusion

Speaker Adaptation Techniques

Environment and Channel Adaptation

Unsupervised and Semi-Supervised Learning for ASR

Multi-Lingual and Cross-Lingual ASR

Practice: Fine-tuning ASR with Adaptation Data

Chapter 4: Advanced Text-to-Speech Synthesis

Autoregressive Acoustic Models (Tacotron, Transformer TTS)

Non-Autoregressive Acoustic Models (FastSpeech, ParaNet)

Flow-Based Models for TTS

Generative Adversarial Networks (GANs) in TTS

Prosody Modeling and Control

Expressive Speech Synthesis

Voice Cloning and Conversion

Hands-on Practical: Training an Advanced TTS Model

Chapter 5: Neural Vocoders and Waveform Generation

Limitations of Traditional Vocoders

Autoregressive Waveform Models (WaveNet, WaveRNN)

Flow-Based Vocoders (WaveGlow, FloWaveNet)

GAN-Based Vocoders (MelGAN, HiFi-GAN)

Diffusion Models for Vocoding

Conditioning Neural Vocoders

Evaluation of Synthesized Audio Quality

Hands-on Practical: Using a Neural Vocoder

Chapter 6: Optimization, Deployment, and Toolkits

Model Quantization for Speech Models

Model Pruning and Sparsification

Knowledge Distillation for ASR/TTS

Optimized Inference Engines (ONNX Runtime, TensorRT)

Deployment Considerations for Streaming ASR

Deployment Considerations for Real-Time TTS

Overview of Speech Processing Toolkits (ESPnet, NeMo, Coqui)

Practice: Optimizing a Speech Model

Decoding Algorithms Comparison

Was this section helpful?

References

Speech Recognition with Weighted Finite-State Transducers, Mehryar Mohri, Fernando Pereira, Michael Riley, 2002 Proceedings of the 2002 IEEE Workshop on Machine Learning for Signal Processing (IEEE) DOI: 10.1109/MLSP.2002.1026040 - Foundational paper detailing the use of Weighted Finite-State Transducers for building and decoding speech recognition systems.
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Alex Graves, Santiago Fernandez, Faustino Gomez, Jürgen Schmidhuber, 2006 Proceedings of the 23rd International Conference on Machine Learning (ICML) (Association for Computing Machinery) DOI: 10.1145/1143844.1143891 - Introduces the Connectionist Temporal Classification (CTC) loss function and its associated decoding algorithms, essential for modern end-to-end ASR.
Sequence Transduction with Recurrent Neural Networks, Alex Graves, 2012 Proceedings of the International Conference of Machine Learning (ICML) 2012 Workshop on Representation Learning, Vol. 27 - Presents the Recurrent Neural Network Transducer (RNN-T) architecture, offering a framework for directly mapping input sequences to output sequences without explicit alignments, crucial for streaming ASR.
Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition, William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE) DOI: 10.1109/ICASSP.2016.7472651 - Seminal paper introducing an end-to-end attention-based encoder-decoder model (LAS) for ASR, demonstrating the effectiveness of sequence-to-sequence learning with attention.
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky, James H. Martin, 2025 (Stanford University) - A comprehensive textbook covering fundamental concepts in speech recognition, including acoustic modeling, language modeling, and decoding algorithms like Viterbi and beam search.

© 2025 ApX Machine Learning