Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis, Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, 2018Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Neural Information Processing Systems Foundation, Inc. (NeurIPS)) - This paper presents a robust method for multi-speaker and expressive TTS by training a style encoder to extract a fixed-dimensional embedding from reference audio, enabling zero-shot style transfer.
Expressive Neural Speech Synthesis by Learning a Disentangled Style Latent Space, Shengqiang Sun, Qinghua Zheng, Fanglei Sun, Yujia Li, Ying Qin, 2020Proceedings of the 28th ACM International Conference on Multimedia (MM '20) (ACM)DOI: 10.1145/3394171.3413988 - Explores using Variational Autoencoders (VAEs) to learn a disentangled latent space for expressive speech synthesis, allowing for independent control over various stylistic aspects.