All Courses

Applied Speech Recognition

Chapter 1: Foundations of Digital Audio and Speech

Introduction to Automatic Speech Recognition Systems

Properties of Human Speech: Phonemes and Allophones

Digital Audio Signals: Sampling, Quantization, and Encoding

Working with Audio Data in Python using Librosa

Time and Frequency Domain Analysis

Introduction to Spectrograms for Speech Visualization

Hands-on Practical: Loading and Visualizing Audio Waveforms

Chapter 2: Feature Extraction for Speech Recognition

The Role of Feature Extraction in ASR

Mel Frequency Cepstral Coefficients (MFCCs)

Calculating MFCCs Step-by-Step

Filter Banks and Log-Mel Spectrograms

Feature Normalization Techniques

Comparing MFCCs and Spectrograms as Input Features

Practice: Extracting and Normalizing Features from a Dataset

Chapter 3: Acoustic Modeling with Deep Neural Networks

Overview of Acoustic Models in ASR

Building Acoustic Models with Recurrent Neural Networks

Addressing Sequential Challenges with LSTMs and GRUs

Connectionist Temporal Classification (CTC) Loss

Implementing a CTC-based ASR Model

Hands-on Practical: Training a Simple LSTM Acoustic Model with CTC

Chapter 4: Advanced Acoustic Models and Architectures

Attention Mechanisms for Speech Recognition

Sequence-to-Sequence (Seq2Seq) Models for ASR

Listen, Attend, and Spell (LAS) Architecture

Introduction to Transformer Models for ASR

Conformer: Combining CNNs and Transformers

An Overview of Pre-trained ASR Models

Practice: Fine-tuning a Pre-trained ASR Model

Chapter 5: Language Modeling and Decoding

The Function of Language Models in ASR

N-gram Language Models

Building an N-gram Model with KenLM

Decoding Graphs for Model Integration

Decoding Algorithms: Greedy Search vs Beam Search

Implementing Beam Search with a Language Model

Hands-on Practical: Integrating a Language Model into a CTC Decoder

Chapter 6: Evaluating and Deploying ASR Systems

Metrics for ASR Performance: WER and CER

Calculating Word Error Rate

Common Data Augmentation Techniques for Speech

Using Hugging Face Pipelines for ASR

Building a Speech-to-Text Application with Gradio

Considerations for Real-time Streaming ASR

Practice: Evaluating and Building a Demo Application

Practice: Extracting and Normalizing Features from a Dataset

Was this section helpful?

References

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky and James H. Martin, 2025 (Stanford University (3rd edition draft)) - Covers the theoretical foundations of feature extraction (including MFCCs, Mel-spectrograms) and normalization (CMVN) in speech recognition.
librosa.feature.melspectrogram documentation, Brian McFee and the librosa developers, 2024 - Provides detailed technical specifications and usage examples for computing Mel spectrograms using the librosa library.
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, S. B. Davis and P. Mermelstein, 1980 IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 (IEEE) DOI: 10.1109/TASSP.1980.1163420 - Introduces Mel-frequency cepstral coefficients (MFCCs), a foundational speech feature discussed as a predecessor to log-mel spectrograms.
Automatic Speech Recognition: A Deep Learning Approach, Dong Yu, Li Deng, 2014 (Springer) DOI: 10.1007/978-1-4614-5306-5 - Discusses feature extraction methods, including Mel-frequency features and normalization, in the context of deep learning-based automatic speech recognition.

© 2025 ApX Machine Learning