Suppr超能文献

StyleTTS 2:通过风格扩散和与大型语音语言模型的对抗训练实现接近人类水平的文本转语音

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.

作者信息

Li Yinghao Aaron, Han Cong, Raghavan Vinay S, Mischler Gavin, Mesgarani Nima

机构信息

Columbia University.

出版信息

Adv Neural Inf Process Syst. 2023 Dec;36:19594-19621. Epub 2023 Dec 10.

Abstract

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

摘要

在本文中,我们展示了StyleTTS 2,这是一种文本转语音(TTS)模型,它利用风格扩散以及与大型语音语言模型(SLM)进行对抗训练,以实现人类水平的TTS合成。StyleTTS 2与其前身不同,它通过扩散模型将风格建模为潜在随机变量,从而在无需参考语音的情况下为文本生成最合适的风格,在受益于扩散模型提供的多样化语音合成的同时实现高效的潜在扩散。此外,我们采用大型预训练的SLM,如WavLM,作为鉴别器,并结合我们新颖的可微时长建模进行端到端训练,从而提高语音自然度。根据以英语为母语的人的判断,StyleTTS 2在单说话者LJSpeech数据集上超越了人类录音,在多说话者VCTK数据集上与人类录音相当。此外,当在LibriTTS数据集上进行训练时,我们的模型在零样本说话者适应方面优于之前公开可用的模型。这项工作在单说话者和多说话者数据集上均实现了首个达到人类水平的TTS,展示了风格扩散以及与大型SLM进行对抗训练的潜力。音频演示和源代码可在https://styletts2.github.io/获取。

相似文献

6
NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.自然语音:具有人类水平质量的端到端文本到语音合成
IEEE Trans Pattern Anal Mach Intell. 2024 Jun;46(6):4234-4245. doi: 10.1109/TPAMI.2024.3356232. Epub 2024 May 7.
8
PHONEME-LEVEL BERT FOR ENHANCED PROSODY OF TEXT-TO-SPEECH WITH GRAPHEME PREDICTIONS.用于通过字素预测增强文本转语音韵律的音素级BERT。
Proc IEEE Int Conf Acoust Speech Signal Process. 2023 Jun;2023. doi: 10.1109/icassp49357.2023.10097074. Epub 2023 May 5.
9
Cycle consistent network for end-to-end style transfer TTS training.循环一致网络用于端到端风格转换 TTS 训练。
Neural Netw. 2021 Aug;140:223-236. doi: 10.1016/j.neunet.2021.03.005. Epub 2021 Mar 16.

引用本文的文献

2
An instantaneous voice synthesis neuroprosthesis.一种即时语音合成神经假体。
bioRxiv. 2024 Sep 20:2024.08.14.607690. doi: 10.1101/2024.08.14.607690.

本文引用的文献

1
NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.自然语音:具有人类水平质量的端到端文本到语音合成
IEEE Trans Pattern Anal Mach Intell. 2024 Jun;46(6):4234-4245. doi: 10.1109/TPAMI.2024.3356232. Epub 2024 May 7.
2
PHONEME-LEVEL BERT FOR ENHANCED PROSODY OF TEXT-TO-SPEECH WITH GRAPHEME PREDICTIONS.用于通过字素预测增强文本转语音韵律的音素级BERT。
Proc IEEE Int Conf Acoust Speech Signal Process. 2023 Jun;2023. doi: 10.1109/icassp49357.2023.10097074. Epub 2023 May 5.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验