All Courses

Applied Speech Recognition

Chapter 1: Foundations of Digital Audio and Speech

Introduction to Automatic Speech Recognition Systems

Properties of Human Speech: Phonemes and Allophones

Digital Audio Signals: Sampling, Quantization, and Encoding

Working with Audio Data in Python using Librosa

Time and Frequency Domain Analysis

Introduction to Spectrograms for Speech Visualization

Hands-on Practical: Loading and Visualizing Audio Waveforms

Chapter 2: Feature Extraction for Speech Recognition

The Role of Feature Extraction in ASR

Mel Frequency Cepstral Coefficients (MFCCs)

Calculating MFCCs Step-by-Step

Filter Banks and Log-Mel Spectrograms

Feature Normalization Techniques

Comparing MFCCs and Spectrograms as Input Features

Practice: Extracting and Normalizing Features from a Dataset

Chapter 3: Acoustic Modeling with Deep Neural Networks

Overview of Acoustic Models in ASR

Building Acoustic Models with Recurrent Neural Networks

Addressing Sequential Challenges with LSTMs and GRUs

Connectionist Temporal Classification (CTC) Loss

Implementing a CTC-based ASR Model

Hands-on Practical: Training a Simple LSTM Acoustic Model with CTC

Chapter 4: Advanced Acoustic Models and Architectures

Attention Mechanisms for Speech Recognition

Sequence-to-Sequence (Seq2Seq) Models for ASR

Listen, Attend, and Spell (LAS) Architecture

Introduction to Transformer Models for ASR

Conformer: Combining CNNs and Transformers

An Overview of Pre-trained ASR Models

Practice: Fine-tuning a Pre-trained ASR Model

Chapter 5: Language Modeling and Decoding

The Function of Language Models in ASR

N-gram Language Models

Building an N-gram Model with KenLM

Decoding Graphs for Model Integration

Decoding Algorithms: Greedy Search vs Beam Search

Implementing Beam Search with a Language Model

Hands-on Practical: Integrating a Language Model into a CTC Decoder

Chapter 6: Evaluating and Deploying ASR Systems

Metrics for ASR Performance: WER and CER

Calculating Word Error Rate

Common Data Augmentation Techniques for Speech

Using Hugging Face Pipelines for ASR

Building a Speech-to-Text Application with Gradio

Considerations for Real-time Streaming ASR

Practice: Evaluating and Building a Demo Application

Comparing MFCCs and Spectrograms as Input Features

Was this section helpful?

References

Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, S. B. Davis and P. Mermelstein, 1980 IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 (IEEE) DOI: 10.1109/TASSP.1980.1163420 - Foundational paper introducing Mel Frequency Cepstral Coefficients (MFCCs), explaining their derivation and utility in speech recognition.
Fundamentals of Speech Recognition, Lawrence R. Rabiner, Biing-Hwang Juang, 1993 (Prentice Hall) - A classic textbook providing comprehensive coverage of traditional speech recognition techniques, including the detailed theory and application of MFCCs with GMM-HMM systems.
Convolutional Neural Networks for Large-Scale Speech Recognition, Osama Abdel-Hamid, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, George Penn, and Dong Yu, 2014 IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 22 (IEEE) DOI: 10.1109/TASLP.2014.2339736 - A seminal paper demonstrating the effectiveness of Convolutional Neural Networks (CNNs) in ASR, which commonly use log-mel spectrograms as input and highlight the CNNs' ability to learn from their local patterns.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - An authoritative, continuously updated online textbook covering modern speech and language processing, including discussions on feature extraction (MFCCs and spectrograms) for deep learning-based ASR.

© 2025 ApX Machine Learning