• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过结构化解缠表示的对抗学习来操纵语音属性。

Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations.

作者信息

Benaroya Laurent, Obin Nicolas, Roebel Axel

机构信息

Analysis/Synthesis Team-STMS, IRCAM, Sorbonne University, CNRS, French Ministry of Culture, 75004 Paris, France.

出版信息

Entropy (Basel). 2023 Feb 18;25(2):375. doi: 10.3390/e25020375.

DOI:10.3390/e25020375
PMID:36832741
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9955323/
Abstract

Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity manipulation and presents an original neural architecture that allows the manipulation of voice attributes (e.g., gender and age). The proposed architecture is inspired by the fader network, transferring the same ideas to voice manipulation. The information conveyed by the speech signal is disentangled into interpretative voice attributes by means of minimizing adversarial loss to make the encoded information mutually independent while preserving the capacity to generate a speech signal from the disentangled codes. During inference for voice conversion, the disentangled voice attributes can be manipulated and the speech signal can be generated accordingly. For experimental evaluation, the proposed method is applied to the task of voice gender conversion using the freely available VCTK dataset. Quantitative measurements of mutual information between the variables of speaker identity and speaker gender show that the proposed architecture can learn gender-independent representation of speakers. Additional measurements of speaker recognition indicate that speaker identity can be recognized accurately from the gender-independent representation. Finally, a subjective experiment conducted on the task of voice gender manipulation shows that the proposed architecture can convert voice gender with very high efficiency and good naturalness.

摘要

语音转换(VC)包括对个人语音进行数字修改,以操控其部分内容,主要是其身份,同时保持其余部分不变。神经语音转换研究已经取得了相当大的突破,能够使用少量数据以高度逼真的效果伪造语音身份。本文超越了语音身份操纵,提出了一种原创的神经架构,该架构允许对语音属性(如性别和年龄)进行操纵。所提出的架构受到渐变网络的启发,将相同的理念应用于语音操纵。通过最小化对抗损失,将语音信号所传达的信息解缠为可解释的语音属性,以使编码信息相互独立,同时保留从解缠后的代码生成语音信号的能力。在语音转换推理过程中,可以操纵解缠后的语音属性,并相应地生成语音信号。为了进行实验评估,将所提出的方法应用于使用免费可得的VCTK数据集进行语音性别转换的任务。对说话者身份和说话者性别的变量之间的互信息进行定量测量表明,所提出的架构可以学习到与性别无关的说话者表示。说话者识别的其他测量结果表明,可以从与性别无关的表示中准确识别说话者身份。最后,针对语音性别操纵任务进行的主观实验表明,所提出的架构能够以非常高的效率和良好的自然度转换语音性别。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/5bb556c0cd01/entropy-25-00375-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/d3e587c52233/entropy-25-00375-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/e7796920a43c/entropy-25-00375-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/4fcec41fc7a4/entropy-25-00375-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/45a4ff2f9ff8/entropy-25-00375-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/390aa3249728/entropy-25-00375-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/5bb556c0cd01/entropy-25-00375-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/d3e587c52233/entropy-25-00375-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/e7796920a43c/entropy-25-00375-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/4fcec41fc7a4/entropy-25-00375-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/45a4ff2f9ff8/entropy-25-00375-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/390aa3249728/entropy-25-00375-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/678f/9955323/5bb556c0cd01/entropy-25-00375-g006.jpg

相似文献

1
Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations.通过结构化解缠表示的对抗学习来操纵语音属性。
Entropy (Basel). 2023 Feb 18;25(2):375. doi: 10.3390/e25020375.
2
STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.STYLETTS-VC:基于风格的语音合成模型知识迁移实现的一次性语音转换
SLT Workshop Spok Lang Technol. 2023 Jan;2022:920-927. doi: 10.1109/slt54892.2023.10022498.
3
GLGAN-VC: A Guided Loss-Based Generative Adversarial Network for Many-to-Many Voice Conversion.GLGAN-VC:一种基于引导损失的多对多语音转换生成对抗网络。
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1813-1826. doi: 10.1109/TNNLS.2023.3335119. Epub 2025 Jan 7.
4
Noise-robust voice conversion with domain adversarial training.基于域对抗训练的抗噪语音转换。
Neural Netw. 2022 Apr;148:74-84. doi: 10.1016/j.neunet.2022.01.003. Epub 2022 Jan 13.
5
A Multidomain Generative Adversarial Network for Hoarse-to-Normal Voice Conversion.用于嘶哑到正常语音转换的多域生成对抗网络。
J Voice. 2023 Oct 14. doi: 10.1016/j.jvoice.2023.08.027.
6
A Step Towards Preserving Speakers' Identity While Detecting Depression Via Speaker Disentanglement.通过说话人解缠在检测抑郁症时迈向保留说话人身份的一步。
Interspeech. 2022 Sep;2022:3338-3342. doi: 10.21437/interspeech.2022-10798.
7
Deep Realistic Facial Editing via Label-restricted Mask Disentanglement.基于标签约束的掩码解缠的深度逼真人脸编辑。
Comput Intell Neurosci. 2022 Nov 23;2022:5652730. doi: 10.1155/2022/5652730. eCollection 2022.
8
Disentangled Representation Learning for Multiple Attributes Preserving Face Deidentification.用于多属性保留面部去识别的解缠表示学习
IEEE Trans Neural Netw Learn Syst. 2022 Jan;33(1):244-256. doi: 10.1109/TNNLS.2020.3027617. Epub 2022 Jan 5.
9
Influence of emotional prosody, content, and repetition on memory recognition of speaker identity.情绪韵律、内容和重复对说话人身份记忆识别的影响。
Q J Exp Psychol (Hove). 2021 Jul;74(7):1185-1201. doi: 10.1177/1747021821998557. Epub 2021 Mar 17.
10
Familiarity and Voice Representation: From Acoustic-Based Representation to Voice Averages.熟悉度与语音表征:从基于声学的表征到语音平均值
Front Psychol. 2017 Jul 14;8:1180. doi: 10.3389/fpsyg.2017.01180. eCollection 2017.

本文引用的文献

1
ClsGAN: Selective Attribute Editing Model based on Classification Adversarial Network.ClsGAN:基于分类对抗网络的选择性属性编辑模型。
Neural Netw. 2021 Jan;133:220-228. doi: 10.1016/j.neunet.2020.10.019. Epub 2020 Nov 10.
2
Age, sex, and vowel dependencies of acoustic measures related to the voice source.与声源相关的声学测量的年龄、性别和元音依赖性。
J Acoust Soc Am. 2007 Apr;121(4):2283-95. doi: 10.1121/1.2697522.