• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

DIA-TTS:基于深度继承注意力的文本到语音合成器。

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer.

作者信息

Yu Junxiao, Xu Zhengyuan, He Xu, Wang Jian, Liu Bin, Feng Rui, Zhu Songsheng, Wang Wei, Li Jianqing

机构信息

Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China.

Department of Medical Engineering, Wannan Medical College, Wuhu 241002, China.

出版信息

Entropy (Basel). 2022 Dec 26;25(1):41. doi: 10.3390/e25010041.

DOI:10.3390/e25010041
PMID:36673182
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9857677/
Abstract

Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate sentences with run-on and wrong breaks regardless of punctuation marks, which causes the synthesized waveform to lack emotion and sound unnatural. In this paper, we propose an end-to-end neural generative TTS model that is based on the deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token-frame correlation, as well as fastens the alignment process. In addition, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden-state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contributes to outperformance in the high-quality prediction of the phrase breaks of the synthesized speech. We used WaveGlow as a vocoder for real-time, human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for the enhancement of phrase breaks and attention robustness.

摘要

文本转语音(TTS)合成器已作为一种重要的辅助工具在各个领域中广泛使用。传统的序列到序列(seq2seq)TTS,如Tacotron2,在编码器和解码器对齐任务中使用单一的软注意力机制,这是其最大的缺点,即在处理长句子时会错误地或重复地生成单词。它还可能生成不分标点符号的连读和错误断句的句子,这导致合成的波形缺乏情感且听起来不自然。在本文中,我们提出了一种端到端神经生成TTS模型,该模型基于深度继承注意力(DIA)机制以及可调节的局部敏感因子(LSF)。继承机制允许通过共享相同的训练参数对DIA进行多次迭代,这加强了令牌与帧的相关性,并加快了对齐过程。此外,采用LSF通过扩展DIA集中区域来增强上下文连接。此外,在解码器中使用了多个RNN块,以更好地提取和生成声学特征。来自多个RNN层的隐藏状态信息用于注意力对齐。DIA和多个RNN层的协同工作有助于在高质量预测合成语音的短语断点方面取得更好的性能。我们使用WaveGlow作为声码器进行实时、类人的音频合成。人类主观实验表明,DIA-TTS在自然度方面的平均意见得分(MOS)为4.48。消融研究进一步证明了DIA机制在增强短语断点和注意力鲁棒性方面的优越性。

相似文献

1
DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer.DIA-TTS:基于深度继承注意力的文本到语音合成器。
Entropy (Basel). 2022 Dec 26;25(1):41. doi: 10.3390/e25010041.
2
Research on Speech Synthesis Based on Mixture Alignment Mechanism.基于混合对齐机制的语音合成研究。
Sensors (Basel). 2023 Aug 20;23(16):7283. doi: 10.3390/s23167283.
3
SR-TTS: a rhyme-based end-to-end speech synthesis system.SR-TTS:一种基于韵律的端到端语音合成系统。
Front Neurorobot. 2024 Feb 27;18:1322312. doi: 10.3389/fnbot.2024.1322312. eCollection 2024.
4
FastTalker: A neural text-to-speech architecture with shallow and group autoregression.快速说话者:一种具有浅层和分组自回归的神经文本转语音架构。
Neural Netw. 2021 Sep;141:306-314. doi: 10.1016/j.neunet.2021.04.016. Epub 2021 Apr 21.
5
Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder.基于信息扰动和说话人编码器的有效零样本多说话人文本到语音技术
Sensors (Basel). 2023 Dec 3;23(23):9591. doi: 10.3390/s23239591.
6
The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset.首个基于越南语FOSD-Tacotron-2的文本转语音模型数据集。
Data Brief. 2020 May 27;31:105775. doi: 10.1016/j.dib.2020.105775. eCollection 2020 Aug.
7
NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.自然语音:具有人类水平质量的端到端文本到语音合成
IEEE Trans Pattern Anal Mach Intell. 2024 Jun;46(6):4234-4245. doi: 10.1109/TPAMI.2024.3356232. Epub 2024 May 7.
8
Deep Reinforcement Learning for Sequence-to-Sequence Models.深度强化学习在序列到序列模型中的应用。
IEEE Trans Neural Netw Learn Syst. 2020 Jul;31(7):2469-2489. doi: 10.1109/TNNLS.2019.2929141. Epub 2019 Aug 15.
9
A real-time voice cloning system with multiple algorithms for speech quality improvement.一种具有多种算法的实时语音克隆系统,可改善语音质量。
PLoS One. 2023 Apr 3;18(4):e0283440. doi: 10.1371/journal.pone.0283440. eCollection 2023.
10
Text Summarization Method Based on Gated Attention Graph Neural Network.基于门控注意力图神经网络的文本摘要方法。
Sensors (Basel). 2023 Feb 2;23(3):1654. doi: 10.3390/s23031654.

引用本文的文献

1
Artificial intelligence empowered voice generation for amyotrophic lateral sclerosis patients.人工智能助力肌萎缩侧索硬化症患者的语音生成。
Sci Rep. 2025 Jan 8;15(1):1361. doi: 10.1038/s41598-024-84728-y.
2
SR-TTS: a rhyme-based end-to-end speech synthesis system.SR-TTS:一种基于韵律的端到端语音合成系统。
Front Neurorobot. 2024 Feb 27;18:1322312. doi: 10.3389/fnbot.2024.1322312. eCollection 2024.
3
Research on Speech Synthesis Based on Mixture Alignment Mechanism.基于混合对齐机制的语音合成研究。
Sensors (Basel). 2023 Aug 20;23(16):7283. doi: 10.3390/s23167283.