Theory provides the blueprint, but implementation is where the system takes shape. To apply principles of feature extraction, including MFCCs and log-mel spectrograms, a complete Python script will be developed. This script processes an entire audio dataset, converting a directory of raw audio files into a collection of normalized feature matrices. These matrices will serve as the direct input for the deep learning models in subsequent chapters.We will structure our code to be reusable and efficient, handling everything from loading individual files to calculating dataset-wide statistics for normalization.Setting Up the WorkspaceBefore we begin, we need a consistent workflow. We'll assume you have a dataset organized with audio files (e.g., in .wav format) and a metadata file that links each audio file to its transcript. A common format for this is a CSV file.Let's start by importing the necessary libraries. We'll use librosa for audio processing, numpy for numerical operations, pandas to handle our metadata, and tqdm to give us a helpful progress bar when processing many files.import pandas as pd import numpy as np import librosa from tqdm import tqdm import os # Configuration parameters for feature extraction SAMPLE_RATE = 16000 N_FFT = 400 # Frame size: 25ms for 16kHz audio (16000 * 0.025) HOP_LENGTH = 160 # Stride: 10ms for 16kHz audio (16000 * 0.010) N_MELS = 80 # Number of Mel filter banksWe define our processing parameters upfront. Using constants like these makes the code cleaner and easier to modify later. A SAMPLE_RATE of 16000 Hz is standard for speech recognition, and the Fast Fourier Transform (FFT) window size (N_FFT) and stride (HOP_LENGTH) are set to correspond to 25ms and 10ms, respectively, which are common values in ASR.The Data Processing WorkflowOur task can be broken down into a clear sequence of steps. We will first write a function to handle a single file, then build a loop to process the entire dataset, calculate normalization statistics, and finally save the prepared features.The following diagram outlines the entire pipeline we are about to build.digraph G { rankdir=TB; graph [bgcolor=transparent, fontname="Inter"]; node [shape=box, style="rounded,filled", fillcolor="#e9ecef", fontname="Inter", margin="0.2,0.1"]; edge [fontname="Inter"]; subgraph cluster_0 { label = "Dataset Processing Script"; bgcolor="#f8f9fa"; style="rounded"; start [label="Audio Dataset\n(WAV files + metadata.csv)", fillcolor="#d0bfff"]; loop [label="For each audio file in metadata:", shape=ellipse, style=filled, fillcolor="#ffec99"]; load_audio [label="Load Audio\n(librosa.load)", fillcolor="#a5d8ff"]; compute_features [label="Compute Log-Mel Spectrogram\n(librosa.feature.melspectrogram)", fillcolor="#a5d8ff"]; collect [label="Collect all feature matrices\nin a list", shape=diamond, style=filled, fillcolor="#ffc9c9"]; start -> loop; loop -> load_audio [label=" 1. "]; load_audio -> compute_features [label=" 2. "]; compute_features -> loop [label=" 3. Repeat "]; } subgraph cluster_1 { label="Normalization and Saving"; bgcolor="#f8f9fa"; style="rounded"; compute_stats [label="Calculate Global Mean & Std\n(np.mean, np.std)", fillcolor="#96f2d7"]; normalize_loop [label="For each feature matrix:", shape=ellipse, style=filled, fillcolor="#ffec99"]; apply_norm [label="Apply Normalization\nfeature = (feature - mean) / std", fillcolor="#b2f2bb"]; save_features [label="Save Normalized Feature (.npy)\nSave Stats (mean, std) (.npz)", fillcolor="#ffd8a8"]; end [label="Prepared Dataset Ready for Training", fillcolor="#d0bfff"]; normalize_loop -> apply_norm; apply_norm -> save_features; save_features -> end; } collect -> compute_stats; compute_stats -> normalize_loop; }This workflow ensures that every audio file is processed identically and that normalization is applied based on statistics from the entire training set.Step 1: Processing a Single Audio FileLet's encapsulate the feature extraction logic for a single file into a function. This function will take a file path, load the audio, and compute the log-mel spectrogram. As discussed in the previous section, log-mel spectrograms often perform better than MFCCs in modern end-to-end deep learning models, so we will use them for our practical implementation.def extract_log_mel_spectrogram(audio_path): """ Loads an audio file and computes its log-mel spectrogram. Args: audio_path (str): Path to the audio file. Returns: numpy.ndarray: The log-mel spectrogram of the audio file. """ # 1. Load the audio file, resample to our target sample rate try: wav, sr = librosa.load(audio_path, sr=SAMPLE_RATE) except Exception as e: print(f"Error loading {audio_path}: {e}") return None # 2. Compute the mel spectrogram mel_spectrogram = librosa.feature.melspectrogram( y=wav, sr=SAMPLE_RATE, n_fft=N_FFT, hop_length=HOP_LENGTH, n_mels=N_MELS ) # 3. Convert to log scale (decibels) log_mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max) return log_mel_spectrogramThis function is the core of our processing pipeline. It loads an audio file, ensuring it's converted to our desired SAMPLE_RATE, then computes the mel spectrogram with our defined parameters, and finally converts the amplitudes to the logarithmic decibel (dB) scale. Using ref=np.max helps stabilize the conversion by scaling relative to the loudest part of the signal.Step 2: Processing the Entire DatasetWith our single-file function ready, we can now iterate through our entire dataset. We will read a metadata.csv file, which we assume contains at least a file_path column. For each file, we'll call our extract_log_mel_spectrogram function and store the resulting features.# Assume metadata.csv and audio files are in a 'data' directory metadata_path = 'data/metadata.csv' audio_dir = 'data/wavs' features_output_dir = 'data/features' # Create the output directory if it doesn't exist os.makedirs(features_output_dir, exist_ok=True) # Load metadata metadata = pd.read_csv(metadata_path) # --- Part 1: Extract all features to calculate global stats --- all_features = [] print("Extracting features for statistics calculation...") for index, row in tqdm(metadata.iterrows(), total=len(metadata)): file_path = os.path.join(audio_dir, row['file_path']) features = extract_log_mel_spectrogram(file_path) if features is not None: all_features.append(features)In this snippet, we first set up our directories and load the metadata. Then, we loop through each file, extract its features, and append the resulting NumPy array to the all_features list. The tqdm library provides a clean and informative progress bar, which is very useful for long-running processes.Step 3: Calculating and Applying NormalizationNow that all_features contains the spectrograms from our entire dataset, we can compute the global mean and standard deviation for Cepstral Mean and Variance Normalization (CMVN). This step is important for stabilizing model training. By normalizing the features to have a mean of 0 and a standard deviation of 1, we ensure the network receives input in a consistent and predictable range.Note on Memory: For very large datasets, loading all features into memory might not be feasible. In such cases, you could use a memory-mapped array or compute the mean and standard deviation iteratively. For most moderately sized datasets, this approach is sufficient and simpler to implement.# --- Part 2: Calculate global mean and standard deviation --- # Concatenate all features along the time axis (axis=1) # We need to handle variable lengths, so we concatenate first concatenated_features = np.concatenate(all_features, axis=1) # Calculate mean and std deviation across all time steps for each mel bin global_mean = np.mean(concatenated_features, axis=1, keepdims=True) global_std = np.std(concatenated_features, axis=1, keepdims=True) # Save these statistics for later use (e.g., during inference) stats_file = os.path.join(features_output_dir, 'normalization_stats.npz') np.savez(stats_file, mean=global_mean, std=global_std) print(f"\nNormalization stats saved to {stats_file}") print(f"Global Mean shape: {global_mean.shape}") print(f"Global Std shape: {global_std.shape}")Here, we first concatenate all feature matrices into a single large matrix. The axis=1 argument is important; it ensures we concatenate along the time dimension, keeping the Mel frequency bins separate. We then compute the mean and std for each of the 80 Mel bins across all time steps in the entire dataset. The keepdims=True argument ensures the output shape is (80, 1), which allows for easy broadcasting during the normalization step. Finally, we save these statistics using np.savez.Step 4: Saving the Normalized FeaturesWith the global statistics computed, the final step is to apply the normalization to each feature matrix and save it to disk. We will save each normalized spectrogram as a separate .npy file. This one-to-one mapping between audio files and feature files makes it easy to load specific items during model training.# --- Part 3: Apply normalization and save individual features --- print("\nApplying normalization and saving features...") for i, features in enumerate(tqdm(all_features)): # Apply normalization normalized_features = (features - global_mean) / (global_std + 1e-8) # Add epsilon to avoid division by zero # Get the original filename to use for the .npy file original_filename = os.path.basename(metadata.iloc[i]['file_path']) filename_without_ext = os.path.splitext(original_filename)[0] # Save the normalized feature matrix output_path = os.path.join(features_output_dir, f"{filename_without_ext}.npy") np.save(output_path, normalized_features) print("\nFeature extraction and normalization complete.") print(f"All normalized features saved in: {features_output_dir}")In this final loop, we iterate through the features we extracted earlier. For each one, we apply the normalization formula: $X_{norm} = (X - \mu) / (\sigma + \epsilon)$. We add a small epsilon ($1e-8$) to the standard deviation to prevent any potential division-by-zero errors. The resulting normalized matrix is then saved as a .npy file, which is a highly efficient format for storing NumPy arrays.By the end of this script, you will have a new directory filled with preprocessed, normalized features. This dataset is now perfectly formatted to be fed into the acoustic models we will begin building in the next chapter.