Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
Microsoft, China.
Neural Netw. 2021 Aug;140:223-236. doi: 10.1016/j.neunet.2021.03.005. Epub 2021 Mar 16.
In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfers. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original unpaired reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.
在本文中,我们提出了一种基于循环一致性网络的端到端 TTS 系统,用于说话风格转换,包括内说话人、外说话人以及未见过的说话人风格转换,支持平行和非平行转换。所提出的方法建立在多说话人变分自编码器(VAE)TTS 模型之上。该模型通常采用配对方式进行训练,这意味着参考语音与输出完全配对,包括说话人身份、文本和风格。为了实现更好的风格转换质量(在大多数情况下是非配对的),我们使用分离的变分风格编码器为模型增加了非配对路径。非配对路径以非配对参考语音作为输入,并生成非配对输出。由于缺乏直接的地面真实目标,非配对输出通过精心设计的循环一致性网络成功地进行了约束。具体来说,正向传输的非配对输出再次作为非配对参考输入输入到模型中,经过反向传输后生成的输出期望与原始非配对参考语音相同。消融研究表明,所提出的模型中的非配对路径、分离的风格编码器和循环一致性网络在提高质量方面是有效的。最终的评估表明,所提出的方法在自然度、语音质量、说话人身份相似性和说话风格相似性等所有六个风格转换类别方面,都显著优于全局风格标记(GST)和基于 VAE 的系统。