Benaroya Laurent, Obin Nicolas, Roebel Axel
Analysis/Synthesis Team-STMS, IRCAM, Sorbonne University, CNRS, French Ministry of Culture, 75004 Paris, France.
Entropy (Basel). 2023 Feb 18;25(2):375. doi: 10.3390/e25020375.
Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity manipulation and presents an original neural architecture that allows the manipulation of voice attributes (e.g., gender and age). The proposed architecture is inspired by the fader network, transferring the same ideas to voice manipulation. The information conveyed by the speech signal is disentangled into interpretative voice attributes by means of minimizing adversarial loss to make the encoded information mutually independent while preserving the capacity to generate a speech signal from the disentangled codes. During inference for voice conversion, the disentangled voice attributes can be manipulated and the speech signal can be generated accordingly. For experimental evaluation, the proposed method is applied to the task of voice gender conversion using the freely available VCTK dataset. Quantitative measurements of mutual information between the variables of speaker identity and speaker gender show that the proposed architecture can learn gender-independent representation of speakers. Additional measurements of speaker recognition indicate that speaker identity can be recognized accurately from the gender-independent representation. Finally, a subjective experiment conducted on the task of voice gender manipulation shows that the proposed architecture can convert voice gender with very high efficiency and good naturalness.
语音转换(VC)包括对个人语音进行数字修改,以操控其部分内容,主要是其身份,同时保持其余部分不变。神经语音转换研究已经取得了相当大的突破,能够使用少量数据以高度逼真的效果伪造语音身份。本文超越了语音身份操纵,提出了一种原创的神经架构,该架构允许对语音属性(如性别和年龄)进行操纵。所提出的架构受到渐变网络的启发,将相同的理念应用于语音操纵。通过最小化对抗损失,将语音信号所传达的信息解缠为可解释的语音属性,以使编码信息相互独立,同时保留从解缠后的代码生成语音信号的能力。在语音转换推理过程中,可以操纵解缠后的语音属性,并相应地生成语音信号。为了进行实验评估,将所提出的方法应用于使用免费可得的VCTK数据集进行语音性别转换的任务。对说话者身份和说话者性别的变量之间的互信息进行定量测量表明,所提出的架构可以学习到与性别无关的说话者表示。说话者识别的其他测量结果表明,可以从与性别无关的表示中准确识别说话者身份。最后,针对语音性别操纵任务进行的主观实验表明,所提出的架构能够以非常高的效率和良好的自然度转换语音性别。