Li Yinghao Aaron, Han Cong, Raghavan Vinay S, Mischler Gavin, Mesgarani Nima
Columbia University.
Adv Neural Inf Process Syst. 2023 Dec;36:19594-19621. Epub 2023 Dec 10.
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.
在本文中,我们展示了StyleTTS 2,这是一种文本转语音(TTS)模型,它利用风格扩散以及与大型语音语言模型(SLM)进行对抗训练,以实现人类水平的TTS合成。StyleTTS 2与其前身不同,它通过扩散模型将风格建模为潜在随机变量,从而在无需参考语音的情况下为文本生成最合适的风格,在受益于扩散模型提供的多样化语音合成的同时实现高效的潜在扩散。此外,我们采用大型预训练的SLM,如WavLM,作为鉴别器,并结合我们新颖的可微时长建模进行端到端训练,从而提高语音自然度。根据以英语为母语的人的判断,StyleTTS 2在单说话者LJSpeech数据集上超越了人类录音,在多说话者VCTK数据集上与人类录音相当。此外,当在LibriTTS数据集上进行训练时,我们的模型在零样本说话者适应方面优于之前公开可用的模型。这项工作在单说话者和多说话者数据集上均实现了首个达到人类水平的TTS,展示了风格扩散以及与大型SLM进行对抗训练的潜力。音频演示和源代码可在https://styletts2.github.io/获取。