STYLETTS-VC：基于风格的语音合成模型知识迁移实现的一次性语音转换

STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.

作者信息

Li Yinghao Aaron, Han Cong, Mesgarani Nima

机构信息

Department of Electrical Engineering, Columbia University, USA.

出版信息

SLT Workshop Spok Lang Technol. 2023 Jan;2022:920-927. doi: 10.1109/slt54892.2023.10022498.

DOI:10.1109/slt54892.2023.10022498

PMID:37577031

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10417535/

Abstract

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.

摘要

一次性语音转换（VC）旨在仅通过目标说话者的几秒钟参考语音，将来自任何源说话者的语音转换为任意目标说话者的语音。这在很大程度上依赖于分离说话者的身份和语音内容，而这一任务仍然具有挑战性。在此，我们提出一种新颖的方法，通过基于风格的文本到语音（TTS）模型进行迁移学习来学习分离的语音表示。通过循环一致和对抗训练，基于风格的TTS模型可以高保真度和相似性地执行转录引导的一次性VC。通过师生知识转移和新颖的数据增强方案学习额外的梅尔频谱编码器，我们的方法无需输入文本即可产生分离的语音表示。主观评估表明，我们的方法在自然度和相似性方面均能显著优于先前的一次性语音转换模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/936c/10417535/8bda43a88c01/nihms-1919646-f0001.jpg

相似文献

STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.STYLETTS-VC：基于风格的语音合成模型知识迁移实现的一次性语音转换

SLT Workshop Spok Lang Technol. 2023 Jan;2022:920-927. doi: 10.1109/slt54892.2023.10022498.

Zero-shot style transfer for gesture animation driven by text and speech using adversarial disentanglement of multimodal style encoding.利用多模态风格编码的对抗解缠实现由文本和语音驱动的手势动画的零样本风格迁移。

Front Artif Intell. 2023 Jun 12;6:1142997. doi: 10.3389/frai.2023.1142997. eCollection 2023.

Noise-robust voice conversion with domain adversarial training.基于域对抗训练的抗噪语音转换。

Neural Netw. 2022 Apr;148:74-84. doi: 10.1016/j.neunet.2022.01.003. Epub 2022 Jan 13.

Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations.通过结构化解缠表示的对抗学习来操纵语音属性。

Entropy (Basel). 2023 Feb 18;25(2):375. doi: 10.3390/e25020375.

Cycle consistent network for end-to-end style transfer TTS training.循环一致网络用于端到端风格转换 TTS 训练。

Neural Netw. 2021 Aug;140:223-236. doi: 10.1016/j.neunet.2021.03.005. Epub 2021 Mar 16.

GLGAN-VC: A Guided Loss-Based Generative Adversarial Network for Many-to-Many Voice Conversion.GLGAN-VC：一种基于引导损失的多对多语音转换生成对抗网络。

IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1813-1826. doi: 10.1109/TNNLS.2023.3335119. Epub 2025 Jan 7.

Joint Dictionary Learning-Based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery.基于联合字典学习的非负矩阵分解用于口腔手术后语音转换以提高语音清晰度

IEEE Trans Biomed Eng. 2017 Nov;64(11):2584-2594. doi: 10.1109/TBME.2016.2644258.

Attention-based speech feature transfer between speakers.基于注意力机制的说话人之间的语音特征转移。

Front Artif Intell. 2024 Feb 26;7:1259641. doi: 10.3389/frai.2024.1259641. eCollection 2024.

E-DGAN: An Encoder-Decoder Generative Adversarial Network Based Method for Pathological to Normal Voice Conversion.E-DGAN：一种基于编解码器生成对抗网络的病理语音到正常语音转换方法。

IEEE J Biomed Health Inform. 2023 May;27(5):2489-2500. doi: 10.1109/JBHI.2023.3239551. Epub 2023 May 4.

Improving the Efficiency of Dysarthria Voice Conversion System Based on Data Augmentation.基于数据增强的构音障碍语音转换系统效率的提升。

IEEE Trans Neural Syst Rehabil Eng. 2023;31:4613-4623. doi: 10.1109/TNSRE.2023.3331524. Epub 2023 Nov 30.

引用本文的文献

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.StyleTTS 2：通过风格扩散和与大型语音语言模型的对抗训练实现接近人类水平的文本转语音

Adv Neural Inf Process Syst. 2023 Dec;36:19594-19621. Epub 2023 Dec 10.

PHONEME-LEVEL BERT FOR ENHANCED PROSODY OF TEXT-TO-SPEECH WITH GRAPHEME PREDICTIONS.用于通过字素预测增强文本转语音韵律的音素级BERT。

Proc IEEE Int Conf Acoust Speech Signal Process. 2023 Jun;2023. doi: 10.1109/icassp49357.2023.10097074. Epub 2023 May 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验