用于语音转换的结构化稀疏谱变换和结构度量

Structured Sparse Spectral Transforms and Structural Measures for Voice Conversion.

作者信息

Zhao Yunxin, Kuruvilla-Dugdale Mili, Song Minguang

机构信息

Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211 USA.

Department of Communication Science and Disorders, University of Missouri, Columbia, MO 65211 USA.

出版信息

IEEE/ACM Trans Audio Speech Lang Process. 2018 Dec;26(12):2267-2276. doi: 10.1109/TASLP.2018.2860682. Epub 2018 Jul 27.

DOI:10.1109/TASLP.2018.2860682

PMID:31984214

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6980218/

Abstract

We investigate a structured sparse spectral transform method for voice conversion (VC) to perform frequency warping and spectral shaping simultaneously on high-dimensional (D) STRAIGHT spectra. Learning a large transform matrix for high-D data often results in an overfit matrix with low sparsity, which leads to muffled speech in VC. We address this problem by using the frequency-warping characteristic of a source-target speaker pair to define a region of support (ROS) in a transform matrix, and further optimize it by nonnegative matrix factorization (NMF) to obtain structured sparse transform. We also investigate structural measures of spectral and temporal covariance and variance at different scales for assessing VC speech quality. Our experiments on ARCTIC dataset of 12 speaker pairs show that embedding the ROS in spectral transforms offers flexibility in tradeoffs between spectral distortion and structure preservation, and the structural measures provide quantitatively reasonable results on converted speech. Our subjective listening tests show that the proposed VC method achieves a mean opinion score of "very good" relative to natural speech, and in comparison with three other VC methods, it is the most preferred one in naturalness and in voice similarity to target speakers.

摘要

我们研究了一种用于语音转换（VC）的结构化稀疏谱变换方法，以在高维（D）STRAIGHT谱上同时执行频率扭曲和谱整形。为高维数据学习一个大的变换矩阵通常会导致一个稀疏度低的过拟合矩阵，这会导致语音转换中的语音模糊。我们通过利用源-目标说话者对的频率扭曲特性在变换矩阵中定义一个支持区域（ROS）来解决这个问题，并通过非负矩阵分解（NMF）进一步优化它以获得结构化稀疏变换。我们还研究了不同尺度下谱和时间协方差及方差的结构度量，以评估语音转换语音质量。我们对12对说话者的ARCTIC数据集进行的实验表明，将ROS嵌入谱变换在谱失真和结构保留之间的权衡中提供了灵活性，并且这些结构度量在转换后的语音上提供了定量合理的结果。我们的主观听力测试表明，相对于自然语音，所提出的语音转换方法获得了“非常好”的平均意见得分，并且与其他三种语音转换方法相比，它在自然度和与目标说话者的语音相似度方面是最受欢迎的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a974/6980218/242a5d06a7ae/nihms-1033095-f0001.jpg

相似文献

Structured Sparse Spectral Transforms and Structural Measures for Voice Conversion.用于语音转换的结构化稀疏谱变换和结构度量

IEEE/ACM Trans Audio Speech Lang Process. 2018 Dec;26(12):2267-2276. doi: 10.1109/TASLP.2018.2860682. Epub 2018 Jul 27.

Voice Conversion for Persons with Amyotrophic Lateral Sclerosis.肌萎缩侧索硬化症患者的语音转换。

IEEE J Biomed Health Inform. 2020 Oct;24(10):2942-2949. doi: 10.1109/JBHI.2019.2961844. Epub 2019 Dec 25.

Joint Dictionary Learning-Based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery.基于联合字典学习的非负矩阵分解用于口腔手术后语音转换以提高语音清晰度

IEEE Trans Biomed Eng. 2017 Nov;64(11):2584-2594. doi: 10.1109/TBME.2016.2644258.

GLGAN-VC: A Guided Loss-Based Generative Adversarial Network for Many-to-Many Voice Conversion.GLGAN-VC：一种基于引导损失的多对多语音转换生成对抗网络。

IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1813-1826. doi: 10.1109/TNNLS.2023.3335119. Epub 2025 Jan 7.

STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.STYLETTS-VC：基于风格的语音合成模型知识迁移实现的一次性语音转换

SLT Workshop Spok Lang Technol. 2023 Jan;2022:920-927. doi: 10.1109/slt54892.2023.10022498.

Automatic source speaker selection for voice conversion.用于语音转换的自动源说话人选择。

J Acoust Soc Am. 2009 Jan;125(1):480-91. doi: 10.1121/1.3027445.

Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations.通过结构化解缠表示的对抗学习来操纵语音属性。

Entropy (Basel). 2023 Feb 18;25(2):375. doi: 10.3390/e25020375.

A joint-feature learning-based voice conversion system for dysarthric user based on deep learning technology.一种基于深度学习技术的、面向构音障碍用户的基于联合特征学习的语音转换系统。

Annu Int Conf IEEE Eng Med Biol Soc. 2019 Jul;2019:1838-1841. doi: 10.1109/EMBC.2019.8856560.

A Multidomain Generative Adversarial Network for Hoarse-to-Normal Voice Conversion.用于嘶哑到正常语音转换的多域生成对抗网络。

J Voice. 2023 Oct 14. doi: 10.1016/j.jvoice.2023.08.027.

Motor imagery classification via combinatory decomposition of ERP and ERSP using sparse nonnegative matrix factorization.基于稀疏非负矩阵分解的ERP与ERSP组合分解的运动想象分类

J Neurosci Methods. 2015 Jul 15;249:41-9. doi: 10.1016/j.jneumeth.2015.03.031. Epub 2015 Apr 3.

引用本文的文献

Voice Conversion for Persons with Amyotrophic Lateral Sclerosis.肌萎缩侧索硬化症患者的语音转换。

IEEE J Biomed Health Inform. 2020 Oct;24(10):2942-2949. doi: 10.1109/JBHI.2019.2961844. Epub 2019 Dec 25.

The Effects of Modulating Fundamental Frequency and Speech Rate on the Intelligibility, Communication Efficiency, and Perceived Naturalness of Synthetic Speech.调节基频和语速对合成语音的可懂度、通信效率及感知自然度的影响

Am J Speech Lang Pathol. 2019 Jul 15;28(2S):875-886. doi: 10.1044/2019_AJSLP-MSC18-18-0052.

本文引用的文献

An evaluation of objective measures for intelligibility prediction of time-frequency weighted noisy speech.基于时频加权噪声语音可懂度预测的客观测量评估。

J Acoust Soc Am. 2011 Nov;130(5):3013-27. doi: 10.1121/1.3641373.

Analysis of speech-based Speech Transmission Index methods with implications for nonlinear operations.基于语音的语音传输指数方法分析及其对非线性操作的影响。

J Acoust Soc Am. 2004 Dec;116(6):3679-89. doi: 10.1121/1.1804628.

Image quality assessment: from error visibility to structural similarity.图像质量评估：从误差可见性到结构相似性。

IEEE Trans Image Process. 2004 Apr;13(4):600-12. doi: 10.1109/tip.2003.819861.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验