StyleTTS 2：通过风格扩散和与大型语音语言模型的对抗训练实现接近人类水平的文本转语音

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.

作者信息

Li Yinghao Aaron, Han Cong, Raghavan Vinay S, Mischler Gavin, Mesgarani Nima

机构信息

Columbia University.

出版信息

Adv Neural Inf Process Syst. 2023 Dec;36:19594-19621. Epub 2023 Dec 10.

PMID:39866554

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11759097/

Abstract

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

摘要

在本文中，我们展示了StyleTTS 2，这是一种文本转语音（TTS）模型，它利用风格扩散以及与大型语音语言模型（SLM）进行对抗训练，以实现人类水平的TTS合成。StyleTTS 2与其前身不同，它通过扩散模型将风格建模为潜在随机变量，从而在无需参考语音的情况下为文本生成最合适的风格，在受益于扩散模型提供的多样化语音合成的同时实现高效的潜在扩散。此外，我们采用大型预训练的SLM，如WavLM，作为鉴别器，并结合我们新颖的可微时长建模进行端到端训练，从而提高语音自然度。根据以英语为母语的人的判断，StyleTTS 2在单说话者LJSpeech数据集上超越了人类录音，在多说话者VCTK数据集上与人类录音相当。此外，当在LibriTTS数据集上进行训练时，我们的模型在零样本说话者适应方面优于之前公开可用的模型。这项工作在单说话者和多说话者数据集上均实现了首个达到人类水平的TTS，展示了风格扩散以及与大型SLM进行对抗训练的潜力。音频演示和源代码可在https://styletts2.github.io/获取。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

StyleTTS 2：通过风格扩散和与大型语音语言模型的对抗训练实现接近人类水平的文本转语音

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

StyleTTS 2：通过风格扩散和与大型语音语言模型的对抗训练实现接近人类水平的文本转语音

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.

作者信息

机构信息

出版信息