Zhao Yunxin, Kuruvilla-Dugdale Mili, Song Minguang
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211 USA.
Department of Communication Science and Disorders, University of Missouri, Columbia, MO 65211 USA.
IEEE/ACM Trans Audio Speech Lang Process. 2018 Dec;26(12):2267-2276. doi: 10.1109/TASLP.2018.2860682. Epub 2018 Jul 27.
We investigate a structured sparse spectral transform method for voice conversion (VC) to perform frequency warping and spectral shaping simultaneously on high-dimensional (D) STRAIGHT spectra. Learning a large transform matrix for high-D data often results in an overfit matrix with low sparsity, which leads to muffled speech in VC. We address this problem by using the frequency-warping characteristic of a source-target speaker pair to define a region of support (ROS) in a transform matrix, and further optimize it by nonnegative matrix factorization (NMF) to obtain structured sparse transform. We also investigate structural measures of spectral and temporal covariance and variance at different scales for assessing VC speech quality. Our experiments on ARCTIC dataset of 12 speaker pairs show that embedding the ROS in spectral transforms offers flexibility in tradeoffs between spectral distortion and structure preservation, and the structural measures provide quantitatively reasonable results on converted speech. Our subjective listening tests show that the proposed VC method achieves a mean opinion score of "very good" relative to natural speech, and in comparison with three other VC methods, it is the most preferred one in naturalness and in voice similarity to target speakers.
我们研究了一种用于语音转换(VC)的结构化稀疏谱变换方法,以在高维(D)STRAIGHT谱上同时执行频率扭曲和谱整形。为高维数据学习一个大的变换矩阵通常会导致一个稀疏度低的过拟合矩阵,这会导致语音转换中的语音模糊。我们通过利用源-目标说话者对的频率扭曲特性在变换矩阵中定义一个支持区域(ROS)来解决这个问题,并通过非负矩阵分解(NMF)进一步优化它以获得结构化稀疏变换。我们还研究了不同尺度下谱和时间协方差及方差的结构度量,以评估语音转换语音质量。我们对12对说话者的ARCTIC数据集进行的实验表明,将ROS嵌入谱变换在谱失真和结构保留之间的权衡中提供了灵活性,并且这些结构度量在转换后的语音上提供了定量合理的结果。我们的主观听力测试表明,相对于自然语音,所提出的语音转换方法获得了“非常好”的平均意见得分,并且与其他三种语音转换方法相比,它在自然度和与目标说话者的语音相似度方面是最受欢迎的。