• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.StyleTTS 2:通过风格扩散和与大型语音语言模型的对抗训练实现接近人类水平的文本转语音
Adv Neural Inf Process Syst. 2023 Dec;36:19594-19621. Epub 2023 Dec 10.
2
High fidelity zero shot speaker adaptation in text to speech synthesis with denoising diffusion GAN.基于去噪扩散生成对抗网络的文本到语音合成中的高保真零样本说话人自适应
Sci Rep. 2025 Feb 20;15(1):6269. doi: 10.1038/s41598-025-90507-0.
3
STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.STYLETTS-VC:基于风格的语音合成模型知识迁移实现的一次性语音转换
SLT Workshop Spok Lang Technol. 2023 Jan;2022:920-927. doi: 10.1109/slt54892.2023.10022498.
4
CMDF-TTS: Text-to-speech method with limited target speaker corpus.CMDF-TTS:基于有限目标说话人语料库的文本转语音方法。
Neural Netw. 2025 Aug;188:107432. doi: 10.1016/j.neunet.2025.107432. Epub 2025 Apr 12.
5
Zero-shot style transfer for gesture animation driven by text and speech using adversarial disentanglement of multimodal style encoding.利用多模态风格编码的对抗解缠实现由文本和语音驱动的手势动画的零样本风格迁移。
Front Artif Intell. 2023 Jun 12;6:1142997. doi: 10.3389/frai.2023.1142997. eCollection 2023.
6
NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.自然语音:具有人类水平质量的端到端文本到语音合成
IEEE Trans Pattern Anal Mach Intell. 2024 Jun;46(6):4234-4245. doi: 10.1109/TPAMI.2024.3356232. Epub 2024 May 7.
7
Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder.基于信息扰动和说话人编码器的有效零样本多说话人文本到语音技术
Sensors (Basel). 2023 Dec 3;23(23):9591. doi: 10.3390/s23239591.
8
PHONEME-LEVEL BERT FOR ENHANCED PROSODY OF TEXT-TO-SPEECH WITH GRAPHEME PREDICTIONS.用于通过字素预测增强文本转语音韵律的音素级BERT。
Proc IEEE Int Conf Acoust Speech Signal Process. 2023 Jun;2023. doi: 10.1109/icassp49357.2023.10097074. Epub 2023 May 5.
9
Cycle consistent network for end-to-end style transfer TTS training.循环一致网络用于端到端风格转换 TTS 训练。
Neural Netw. 2021 Aug;140:223-236. doi: 10.1016/j.neunet.2021.03.005. Epub 2021 Mar 16.
10
The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset.首个基于越南语FOSD-Tacotron-2的文本转语音模型数据集。
Data Brief. 2020 May 27;31:105775. doi: 10.1016/j.dib.2020.105775. eCollection 2020 Aug.

引用本文的文献

1
An instantaneous voice-synthesis neuroprosthesis.一种即时语音合成神经假体。
Nature. 2025 Jun 12. doi: 10.1038/s41586-025-09127-3.
2
An instantaneous voice synthesis neuroprosthesis.一种即时语音合成神经假体。
bioRxiv. 2024 Sep 20:2024.08.14.607690. doi: 10.1101/2024.08.14.607690.

本文引用的文献

1
NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.自然语音:具有人类水平质量的端到端文本到语音合成
IEEE Trans Pattern Anal Mach Intell. 2024 Jun;46(6):4234-4245. doi: 10.1109/TPAMI.2024.3356232. Epub 2024 May 7.
2
PHONEME-LEVEL BERT FOR ENHANCED PROSODY OF TEXT-TO-SPEECH WITH GRAPHEME PREDICTIONS.用于通过字素预测增强文本转语音韵律的音素级BERT。
Proc IEEE Int Conf Acoust Speech Signal Process. 2023 Jun;2023. doi: 10.1109/icassp49357.2023.10097074. Epub 2023 May 5.
3
STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.STYLETTS-VC:基于风格的语音合成模型知识迁移实现的一次性语音转换
SLT Workshop Spok Lang Technol. 2023 Jan;2022:920-927. doi: 10.1109/slt54892.2023.10022498.

StyleTTS 2:通过风格扩散和与大型语音语言模型的对抗训练实现接近人类水平的文本转语音

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.

作者信息

Li Yinghao Aaron, Han Cong, Raghavan Vinay S, Mischler Gavin, Mesgarani Nima

机构信息

Columbia University.

出版信息

Adv Neural Inf Process Syst. 2023 Dec;36:19594-19621. Epub 2023 Dec 10.

PMID:39866554
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11759097/
Abstract

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

摘要

在本文中,我们展示了StyleTTS 2,这是一种文本转语音(TTS)模型,它利用风格扩散以及与大型语音语言模型(SLM)进行对抗训练,以实现人类水平的TTS合成。StyleTTS 2与其前身不同,它通过扩散模型将风格建模为潜在随机变量,从而在无需参考语音的情况下为文本生成最合适的风格,在受益于扩散模型提供的多样化语音合成的同时实现高效的潜在扩散。此外,我们采用大型预训练的SLM,如WavLM,作为鉴别器,并结合我们新颖的可微时长建模进行端到端训练,从而提高语音自然度。根据以英语为母语的人的判断,StyleTTS 2在单说话者LJSpeech数据集上超越了人类录音,在多说话者VCTK数据集上与人类录音相当。此外,当在LibriTTS数据集上进行训练时,我们的模型在零样本说话者适应方面优于之前公开可用的模型。这项工作在单说话者和多说话者数据集上均实现了首个达到人类水平的TTS,展示了风格扩散以及与大型SLM进行对抗训练的潜力。音频演示和源代码可在https://styletts2.github.io/获取。