Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis, Yongqiang Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Ye Jia, Patrick Nguyen, Heiga Zen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, 2018Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Neural Information Processing Systems Foundation, Inc. (NeurIPS)) - A foundational paper demonstrating zero-shot voice cloning by leveraging speaker embeddings from a pre-trained speaker verification model to condition a multi-speaker Tacotron-based TTS system.
Autovc: Zero-Shot Voice Style Transfer with Only Autoencoder-Based Voice Conversion, Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Mark Hasegawa-Johnson, 2019International Conference on Machine Learning, Vol. 97 (PMLR (Proceedings of Machine Learning Research)) - Presents AutoVC, a significant architecture for zero-shot voice conversion that disentangles speaker identity, content, and prosody using an autoencoder framework.
FiLM: Visual Reasoning with a General Conditioning Layer, Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, Aaron Courville, 2018Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32(1) (Association for the Advancement of Artificial Intelligence)DOI: 10.1609/aaai.v32i1.11671 - Introduces Feature-wise Linear Modulation (FiLM), a general conditioning method widely applied in neural networks, including its use for integrating speaker embeddings in advanced speech synthesis models.