Suppr超能文献

DIA-TTS:基于深度继承注意力的文本到语音合成器。

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer.

作者信息

Yu Junxiao, Xu Zhengyuan, He Xu, Wang Jian, Liu Bin, Feng Rui, Zhu Songsheng, Wang Wei, Li Jianqing

机构信息

Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China.

Department of Medical Engineering, Wannan Medical College, Wuhu 241002, China.

出版信息

Entropy (Basel). 2022 Dec 26;25(1):41. doi: 10.3390/e25010041.

Abstract

Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate sentences with run-on and wrong breaks regardless of punctuation marks, which causes the synthesized waveform to lack emotion and sound unnatural. In this paper, we propose an end-to-end neural generative TTS model that is based on the deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token-frame correlation, as well as fastens the alignment process. In addition, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden-state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contributes to outperformance in the high-quality prediction of the phrase breaks of the synthesized speech. We used WaveGlow as a vocoder for real-time, human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for the enhancement of phrase breaks and attention robustness.

摘要

文本转语音(TTS)合成器已作为一种重要的辅助工具在各个领域中广泛使用。传统的序列到序列(seq2seq)TTS,如Tacotron2,在编码器和解码器对齐任务中使用单一的软注意力机制,这是其最大的缺点,即在处理长句子时会错误地或重复地生成单词。它还可能生成不分标点符号的连读和错误断句的句子,这导致合成的波形缺乏情感且听起来不自然。在本文中,我们提出了一种端到端神经生成TTS模型,该模型基于深度继承注意力(DIA)机制以及可调节的局部敏感因子(LSF)。继承机制允许通过共享相同的训练参数对DIA进行多次迭代,这加强了令牌与帧的相关性,并加快了对齐过程。此外,采用LSF通过扩展DIA集中区域来增强上下文连接。此外,在解码器中使用了多个RNN块,以更好地提取和生成声学特征。来自多个RNN层的隐藏状态信息用于注意力对齐。DIA和多个RNN层的协同工作有助于在高质量预测合成语音的短语断点方面取得更好的性能。我们使用WaveGlow作为声码器进行实时、类人的音频合成。人类主观实验表明,DIA-TTS在自然度方面的平均意见得分(MOS)为4.48。消融研究进一步证明了DIA机制在增强短语断点和注意力鲁棒性方面的优越性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验