循环一致网络用于端到端风格转换 TTS 训练。

Cycle consistent network for end-to-end style transfer TTS training.

机构信息

Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China.

Microsoft, China.

出版信息

Neural Netw. 2021 Aug;140:223-236. doi: 10.1016/j.neunet.2021.03.005. Epub 2021 Mar 16.

DOI:10.1016/j.neunet.2021.03.005

PMID:33780874

Abstract

In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfers. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original unpaired reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.

摘要

在本文中，我们提出了一种基于循环一致性网络的端到端 TTS 系统，用于说话风格转换，包括内说话人、外说话人以及未见过的说话人风格转换，支持平行和非平行转换。所提出的方法建立在多说话人变分自编码器（VAE）TTS 模型之上。该模型通常采用配对方式进行训练，这意味着参考语音与输出完全配对，包括说话人身份、文本和风格。为了实现更好的风格转换质量（在大多数情况下是非配对的），我们使用分离的变分风格编码器为模型增加了非配对路径。非配对路径以非配对参考语音作为输入，并生成非配对输出。由于缺乏直接的地面真实目标，非配对输出通过精心设计的循环一致性网络成功地进行了约束。具体来说，正向传输的非配对输出再次作为非配对参考输入输入到模型中，经过反向传输后生成的输出期望与原始非配对参考语音相同。消融研究表明，所提出的模型中的非配对路径、分离的风格编码器和循环一致性网络在提高质量方面是有效的。最终的评估表明，所提出的方法在自然度、语音质量、说话人身份相似性和说话风格相似性等所有六个风格转换类别方面，都显著优于全局风格标记（GST）和基于 VAE 的系统。

相似文献

Cycle consistent network for end-to-end style transfer TTS training.循环一致网络用于端到端风格转换 TTS 训练。

Neural Netw. 2021 Aug;140:223-236. doi: 10.1016/j.neunet.2021.03.005. Epub 2021 Mar 16.

STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.STYLETTS-VC：基于风格的语音合成模型知识迁移实现的一次性语音转换

SLT Workshop Spok Lang Technol. 2023 Jan;2022:920-927. doi: 10.1109/slt54892.2023.10022498.

Zero-shot style transfer for gesture animation driven by text and speech using adversarial disentanglement of multimodal style encoding.利用多模态风格编码的对抗解缠实现由文本和语音驱动的手势动画的零样本风格迁移。

Front Artif Intell. 2023 Jun 12;6:1142997. doi: 10.3389/frai.2023.1142997. eCollection 2023.

Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder.基于信息扰动和说话人编码器的有效零样本多说话人文本到语音技术

Sensors (Basel). 2023 Dec 3;23(23):9591. doi: 10.3390/s23239591.

Speaker separation in realistic noise environments with applications to a cognitively-controlled hearing aid.在现实噪声环境中的说话人分离及其在认知控制助听器中的应用。

Neural Netw. 2021 Aug;140:136-147. doi: 10.1016/j.neunet.2021.02.020. Epub 2021 Mar 4.

End-to-end keyword search system based on attention mechanism and energy scorer for low resource languages.基于注意力机制和能量得分器的针对低资源语言的端到端关键词搜索系统。

Neural Netw. 2021 Jul;139:326-334. doi: 10.1016/j.neunet.2021.04.002. Epub 2021 Apr 10.

Noise-robust voice conversion with domain adversarial training.基于域对抗训练的抗噪语音转换。

Neural Netw. 2022 Apr;148:74-84. doi: 10.1016/j.neunet.2022.01.003. Epub 2022 Jan 13.

Attention-based speech feature transfer between speakers.基于注意力机制的说话人之间的语音特征转移。

Front Artif Intell. 2024 Feb 26;7:1259641. doi: 10.3389/frai.2024.1259641. eCollection 2024.

Short-time speaker verification with different speaking style utterances.采用不同说话风格语音的短时说话人验证。

PLoS One. 2020 Nov 11;15(11):e0241809. doi: 10.1371/journal.pone.0241809. eCollection 2020.

SR-TTS: a rhyme-based end-to-end speech synthesis system.SR-TTS：一种基于韵律的端到端语音合成系统。

Front Neurorobot. 2024 Feb 27;18:1322312. doi: 10.3389/fnbot.2024.1322312. eCollection 2024.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

循环一致网络用于端到端风格转换 TTS 训练。

Cycle consistent network for end-to-end style transfer TTS training.

机构信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献