Li Yinghao Aaron, Han Cong, Mesgarani Nima
Department of Electrical Engineering, Columbia University, USA.
SLT Workshop Spok Lang Technol. 2023 Jan;2022:920-927. doi: 10.1109/slt54892.2023.10022498.
One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.
一次性语音转换(VC)旨在仅通过目标说话者的几秒钟参考语音,将来自任何源说话者的语音转换为任意目标说话者的语音。这在很大程度上依赖于分离说话者的身份和语音内容,而这一任务仍然具有挑战性。在此,我们提出一种新颖的方法,通过基于风格的文本到语音(TTS)模型进行迁移学习来学习分离的语音表示。通过循环一致和对抗训练,基于风格的TTS模型可以高保真度和相似性地执行转录引导的一次性VC。通过师生知识转移和新颖的数据增强方案学习额外的梅尔频谱编码器,我们的方法无需输入文本即可产生分离的语音表示。主观评估表明,我们的方法在自然度和相似性方面均能显著优于先前的一次性语音转换模型。