Suppr超能文献

CMDF-TTS:基于有限目标说话人语料库的文本转语音方法。

CMDF-TTS: Text-to-speech method with limited target speaker corpus.

作者信息

Tao Ye, Liu Jiawang, Lu Chaofeng, Liu Meng, Qin Xiugong, Tian Yunlong, Du Yongjie

机构信息

School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, PR China.

School of Reliability and Systems Engineering, Beihang University, Beijing 100191, PR China.

出版信息

Neural Netw. 2025 Aug;188:107432. doi: 10.1016/j.neunet.2025.107432. Epub 2025 Apr 12.

Abstract

While end-to-end Text-to-Speech (TTS) methods with limited target speaker corpus can generate high-quality speech, they often require a non-target speaker corpus (auxiliary corpus) which contains a substantial amount of <text, speech> pairs to train the model, significantly increasing training costs. In this work, we propose a fast and high-quality speech synthesis approach, requiring few target speaker recordings. Based on statistics, we analyzed the role of phonemes, function words, and utterance target domains in the corpus and proposed a Statistical-based Compression Auxiliary Corpus algorithm (SCAC). It significantly improves model training speed without a noticeable decrease in speech naturalness. Next, we use the compressed corpus to train the proposed non-autoregressive model CMDF-TTS, which uses a multi-level prosody modeling module to obtain more information and Denoising Diffusion Probabilistic Models (DDPMs) to generate mel-spectrograms. Besides, we fine-tune the model using the target speaker corpus to embed the speaker's characteristics into the model and Conditional Variational Auto-Encoder Generative Adversarial Networks(CVAE-GAN) to enhance further the synthesized speech's quality. Experimental results on multiple Mandarin and English corpus demonstrate that the CMDF-TTS model, enhanced by the SCAC algorithm, effectively balances training speed and synthesized speech quality. Overall, its performance surpasses that of state-of-the-art models.

摘要

虽然在目标说话人语料库有限的情况下,端到端文本转语音(TTS)方法可以生成高质量语音,但它们通常需要一个包含大量<文本,语音>对的非目标说话人语料库(辅助语料库)来训练模型,这显著增加了训练成本。在这项工作中,我们提出了一种快速且高质量的语音合成方法,只需少量目标说话人的录音。基于统计分析,我们研究了音素、功能词和话语目标领域在语料库中的作用,并提出了一种基于统计的压缩辅助语料库算法(SCAC)。该算法显著提高了模型训练速度,且语音自然度没有明显下降。接下来,我们使用压缩后的语料库训练所提出的非自回归模型CMDF-TTS,该模型使用多级韵律建模模块来获取更多信息,并利用去噪扩散概率模型(DDPM)生成梅尔频谱图。此外,我们使用目标说话人语料库对模型进行微调,将说话人的特征嵌入到模型中,并使用条件变分自编码器生成对抗网络(CVAE-GAN)进一步提高合成语音的质量。在多个中文和英文语料库上的实验结果表明,通过SCAC算法增强的CMDF-TTS模型有效地平衡了训练速度和合成语音质量。总体而言,其性能超过了现有最先进的模型。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验