Information Engineering University, Zhengzhou 450001, China.
Handan Vocational College of Science and Technology, Handan 056000, China.
Comput Intell Neurosci. 2022 Oct 11;2022:6707304. doi: 10.1155/2022/6707304. eCollection 2022.
With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model parameters, inference speed, and voice quality, a voice cloning method based on improved HiFi-GAN has been proposed in this paper. (1) To improve the feature representation ability of the speaker encoder, the x-vector is used as the embedding vector that can characterize the target speaker. (2) To improve the performance of the HiFi-GAN vocoder, the input Mel spectrum is processed by a competitive multiscale convolution strategy. (3) The one-dimensional depth-wise separable convolution is used to replace all standard one-dimensional convolutions, significantly reducing the model parameters and increasing the inference speed. The improved HiFi-GAN model remarkably reduces the number of vocoder model parameters by about 68.58% and boosts the model's inference speed. The inference speed on the GPU and CPU has increased by 11.84% and 30.99%, respectively. Voice quality has also been marginally improved as MOS increased by 0.13 and PESQ increased by 0.11. The improved HiFi-GAN model exhibits outstanding performance and remarkable compatibility in the voice cloning task. Combined with the x-vector embedding, the proposed model achieves the highest score of all the models and test sets.
为了通过使用目标说话者的少量语音样本将源文本转语音 (TTS) 模型适配到个人语音,语音克隆提供了一种特定的 TTS 服务。尽管基于 Tacotron 2 的多说话人 TTS 系统可以通过在说话人编码器中引入 d-vector 来实现语音克隆,但 d-vector 描述的说话人特征无法包含整个话语的语音信息。这会影响语音克隆的相似度。作为声码器,WaveNet 牺牲了语音生成速度。为了平衡模型参数、推理速度和语音质量之间的关系,本文提出了一种基于改进的 HiFi-GAN 的语音克隆方法。(1) 为了提高说话人编码器的特征表示能力,使用 x-vector 作为可以表征目标说话人的嵌入向量。(2) 为了提高 HiFi-GAN 声码器的性能,对输入梅尔频谱进行竞争多尺度卷积策略处理。(3) 使用一维深度可分离卷积替换所有标准的一维卷积,大大减少了模型参数并提高了推理速度。改进的 HiFi-GAN 模型显著减少了声码器模型参数的数量,约减少了 68.58%,并提高了模型的推理速度。GPU 和 CPU 上的推理速度分别提高了 11.84%和 30.99%。MOS 提高了 0.13,PESQ 提高了 0.11,语音质量也略有提高。改进的 HiFi-GAN 模型在语音克隆任务中表现出色,具有出色的性能和兼容性。结合 x-vector 嵌入,所提出的模型在所有模型和测试集上都获得了最高分。