IEEE J Biomed Health Inform. 2023 May;27(5):2489-2500. doi: 10.1109/JBHI.2023.3239551. Epub 2023 May 4.
In recent years, more and more people suffer from voice-related diseases. Given the limitations of current pathological speech conversion methods, that is, a method can only convert a single kind of pathological voice. In this study, we propose a novel Encoder-Decoder Generative Adversarial Network (E-DGAN) to generate personalized speech for pathological to normal voice conversion, which is suitable for multiple kinds of pathological voices. Our proposed method can also solve the problem of improving the intelligibility and personalizing custom speech of pathological voices. Feature extraction is performed using a mel filter bank. The conversion network is an encoder-decoder structure, which is used to convert the mel spectrogram of pathological voices to the mel spectrogram of normal voices. After being converted by the residual conversion network, the personalized normal speech is synthesized by the neural vocoder. In addition, we propose a subjective evaluation metric named "content similarity" to evaluate the consistency between the converted pathological voice content and the reference content. The Saarbrücken Voice Database (SVD) is used to verify the proposed method. The intelligibility and content similarity of pathological voices are increased by 18.67% and 2.60%, respectively. Besides, an intuitive analysis based on a spectrogram was done and a significant improvement was achieved. The results show that our proposed method can improve the intelligibility of pathological voices and personalize the conversion of pathological voices into the normal voices of 20 different speakers. Our proposed method is compared with five other pathological voice conversion methods, and our proposed method has the best evaluation results.
近年来,越来越多的人患有与声音相关的疾病。鉴于当前病理语音转换方法的局限性,即一种方法只能转换单一类型的病理语音。在本研究中,我们提出了一种新颖的编码器-解码器生成对抗网络(E-DGAN),用于将病理语音转换为正常语音,适用于多种病理语音。我们提出的方法还可以解决提高病理语音可懂度和个性化定制语音的问题。特征提取使用梅尔滤波器组进行。转换网络是一个编码器-解码器结构,用于将病理语音的梅尔频谱图转换为正常语音的梅尔频谱图。经过残差转换网络的转换后,由神经声码器合成个性化的正常语音。此外,我们提出了一种名为“内容相似性”的主观评估指标,用于评估转换后的病理语音内容与参考内容的一致性。我们使用 Saarbrücken 语音数据库(SVD)来验证所提出的方法。病理语音的可懂度和内容相似度分别提高了 18.67%和 2.60%。此外,还进行了基于频谱图的直观分析,并取得了显著的改进。结果表明,我们提出的方法可以提高病理语音的可懂度,并将病理语音个性化转换为 20 个不同说话者的正常语音。我们提出的方法与其他五种病理语音转换方法进行了比较,我们提出的方法具有最好的评估结果。