Neural vocoder architectures, including autoregressive models like WaveNet and efficient GAN-based approaches such as HiFi-GAN, are important for transforming intermediate acoustic features (often mel-spectrograms) generated by Text-to-Speech (TTS) systems into final, audible waveforms. This practical guide covers using a pre-trained neural vocoder to perform this synthesis step. The primary focus is on taking existing mel-spectrograms and converting them into audio, simulating the final stage of a modern TTS pipeline.For this exercise, we'll use the TTS library, a popular open-source toolkit, and a pre-trained HiFi-GAN vocoder model. HiFi-GAN is known for its high fidelity and computational efficiency, making it a common choice in many TTS systems.Environment SetupFirst, ensure you have the necessary library installed. We primarily need TTS which bundles torch and other dependencies. You might also need soundfile for saving the audio.# Install the TTS library from its repository pip install TTS # Install soundfile if you don't have it pip install soundfile numpyYou'll also need a pre-computed mel-spectrogram file as input for the vocoder. For this example, let's assume you have a NumPy file named sample_mel_spectrogram.npy. This file would typically be the output of an acoustic model (like Tacotron 2 or FastSpeech 2) from a preceding TTS stage, representing the acoustic features for a specific utterance.Note: Generating this mel-spectrogram file itself involves running a separate TTS acoustic model, which was covered in Chapter 4. For this exercise, focus on the vocoder's role assuming the mel-spectrogram is already available.Loading the Pre-trained VocoderThe TTS library provides a convenient interface to load various pre-trained models. We will load a HiFi-GAN model trained on the LJSpeech dataset.import torch from TTS.utils.manage import ModelManager from TTS.utils.synthesizer import Synthesizer # Define the path to downloaded models (or where they will be downloaded) # Replace with your preferred path path = "~/.local/share/tts/" manager = ModelManager(path) # List available vocoder models (optional, for exploration) # print(manager.list_models()) # Download and load a pre-trained HiFi-GAN vocoder model # Example: Using a universal HiFi-GAN model vocoder_model_name = "vocoder_models/universal/libri-tts/wavegrad" # Or use a specific LJSpeech HiFi-GAN if available and preferred: # vocoder_model_name = "vocoder_models/en/ljspeech/hifigan_v2" try: vocoder_path, vocoder_config_path, _ = manager.download_model(vocoder_model_name) except ValueError as e: print(f"Error downloading model: {e}") print(f"Please check the model name or your internet connection.") # Provide guidance on finding correct model names if needed print("You can list available models using manager.list_models()") exit() # Exit if model download fails # Check if CUDA (GPU) is available, otherwise use CPU device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using device: {device}") # Initialize the Synthesizer with only the vocoder # We don't need a TTS model here since we provide the mel-spectrogram directly syn = Synthesizer( tts_checkpoint=None, # No TTS model checkpoint tts_config_path=None, # No TTS model config vocoder_checkpoint=vocoder_path, vocoder_config=vocoder_config_path, use_cuda=(device == "cuda"), ) print("Neural Vocoder model loaded successfully.") # The syn.vocoder object holds the loaded vocoder model. # Example: syn.vocoder.model is the actual HiFi-GAN generator networkThis code snippet initializes the ModelManager to handle model downloads and then uses the Synthesizer class, configured only with the vocoder model details. The library downloads the specified HiFi-GAN model if it's not already present locally. It also automatically detects if a GPU is available for faster processing.Loading the Acoustic Features (Mel-Spectrogram)Now, load the pre-computed mel-spectrogram from the .npy file. This file should contain a 2D NumPy array where one dimension represents the mel frequency bins and the other represents time frames.import numpy as np # Load the mel-spectrogram from a file # Replace 'sample_mel_spectrogram.npy' with the actual path to your file mel_file = 'sample_mel_spectrogram.npy' try: mel_spectrogram = np.load(mel_file) print(f"Loaded mel-spectrogram from {mel_file}") print(f"Shape: {mel_spectrogram.shape}") # Example shape: (80, 250) -> 80 mel bins, 250 frames except FileNotFoundError: print(f"Error: Mel-spectrogram file not found at {mel_file}") print("Please ensure the file exists or provide the correct path.") # Create a dummy spectrogram for demonstration if needed print("Creating a dummy mel-spectrogram for demonstration.") mel_spectrogram = np.random.rand(80, 250).astype(np.float32) # 80 mel bins, 250 frames except Exception as e: print(f"Error loading mel-spectrogram: {e}") exit() # The TTS Synthesizer expects the mel-spectrogram as a Torch tensor # It should also have a batch dimension added. # Shape expected by many vocoders: [batch_size, num_mels, num_frames] mel_tensor = torch.tensor(mel_spectrogram).unsqueeze(0).to(device) print(f"Converted mel-spectrogram to tensor with shape: {mel_tensor.shape}")Here, we load the NumPy array and convert it into a PyTorch tensor. Crucially, we add a batch dimension (unsqueeze(0)) as most deep learning models expect batched input, even if the batch size is just one. We also move the tensor to the appropriate device (CPU or GPU).Generating the WaveformWith the vocoder loaded and the input mel-spectrogram prepared, we can perform the inference step. The Synthesizer object provides a convenient method (tts or accessing the vocoder directly) to generate the waveform. Since we are bypassing the text-to-mel stage, we use the underlying vocoder's inference capability.# Use the synthesizer's vocoder to convert mel-spectrogram to waveform # The `vocoder.inference` method typically handles this # Note: The specific method might vary slightly based on the TTS library version. # Check documentation if needed. The Synthesizer often wraps this. print("Generating waveform from mel-spectrogram...") # We pass the mel_tensor directly to the vocoder's inference method # Ensure the tensor is on the correct device outputs = syn.vocoder.inference(mel_tensor) # The output is usually a tensor containing the raw audio waveform samples. # It might be on the GPU, so move it to CPU and convert to NumPy array. # The output tensor shape might be [batch_size, 1, num_samples] or [batch_size, num_samples] waveform = outputs.squeeze().cpu().numpy() print(f"Generated waveform with shape: {waveform.shape}") # Example shape: (55125,) -> number of audio samples print("Waveform generation complete.")The inference method of the loaded vocoder model takes the mel-spectrogram tensor as input and outputs the corresponding audio waveform tensor. We then convert this tensor back to a NumPy array on the CPU for easier handling and saving.Saving and ListeningFinally, save the generated waveform as a standard audio file (like WAV) and listen to it. You'll need the sample rate associated with the pre-trained vocoder model. This is usually stored in the model's configuration.import soundfile as sf # Get the sample rate from the synthesizer's configuration # This ensures the audio is saved and played back correctly output_sample_rate = syn.vocoder_config.get('audio', {}).get('sample_rate', 22050) # Default to 22050 if not found print(f"Using sample rate: {output_sample_rate} Hz") # Define the output file path output_wav_file = 'generated_audio_hifigan.wav' # Save the waveform as a WAV file try: sf.write(output_wav_file, waveform, output_sample_rate) print(f"Audio saved successfully to {output_wav_file}") except Exception as e: print(f"Error saving audio file: {e}") print("\nPractical complete. You can now listen to the generated audio file.")This code retrieves the correct sample rate from the vocoder's configuration (essential for correct playback speed) and uses the soundfile library to write the NumPy array containing the waveform samples into a .wav file.Listen to the generated_audio_hifigan.wav file. Compare its quality to examples you might have heard from traditional vocoders like Griffin-Lim. Does it sound natural? Are there noticeable artifacts (like buzzing or hissing)? This hands-on experience directly demonstrates the quality improvements offered by modern neural vocoders like HiFi-GAN, which you learned about earlier in the chapter. You can experiment further by obtaining mel-spectrograms for different sentences or using different pre-trained vocoder models (e.g., WaveGrad, MelGAN) if available through the toolkit to compare their outputs.