Hsu Wen-Shin, Lin Guang-Tao, Wang Wei-Hsun
Department of Medical Information, Chung Shan Medical University, Taichung 402201, Taiwan.
Informatics Office Technology, Chung Shan Medical University Hospital, Taichung 402201, Taiwan.
Diagnostics (Basel). 2024 Nov 29;14(23):2693. doi: 10.3390/diagnostics14232693.
Dysarthria, a motor speech disorder caused by neurological damage, significantly hampers speech intelligibility, creating communication barriers for affected individuals. Voice conversion (VC) systems have been developed to address this, yet accurately predicting phonemes in dysarthric speech remains a challenge due to its variability. This study proposes a novel approach that integrates Fuzzy Expectation Maximization (FEM) with diffusion models for enhanced phoneme prediction, aiming to improve the quality of dysarthric voice conversion. The proposed method combines FEM clustering with Diffusion Probabilistic Models (DPM). Diffusion models simulate noise addition and removal to enhance the robustness of speech signals, while FEM iteratively optimizes phoneme boundaries, reducing uncertainty. The system was trained using the Saarland University Voice Disorder dataset, consisting of dysarthric and normal speech samples, with the conversion process represented in the Mel-spectrogram domain. The framework employs both subjective (Mean Opinion Score, MOS) and objective (Word Error Rate, WER) metrics for evaluation, complemented by ablation studies. Experimental results showed that the proposed method significantly improved phoneme prediction accuracy and overall voice conversion quality. It achieved higher MOSs for naturalness, intelligibility, and speaker similarity compared to existing models like StarGAN-VC and CycleGAN-VC. Additionally, the proposed method demonstrated a lower WER for both mild and severe dysarthria cases, indicating better performance in producing intelligible speech. The integration of FEM with diffusion models offers substantial improvements in handling the irregularities of dysarthric speech. The method's robustness, as evidenced by the ablation studies, shows that it can maintain speech naturalness and intelligibility even without a speaker-encoder. These findings suggest that the proposed approach can contribute to the development of more reliable assistive communication technologies for individuals with dysarthria, providing a promising foundation for future advancements in personalized speech therapy.
构音障碍是一种由神经损伤引起的运动性言语障碍,严重妨碍言语清晰度,给受影响个体造成沟通障碍。人们已开发出语音转换(VC)系统来解决这一问题,但由于构音障碍语音的变异性,准确预测其中的音素仍然是一项挑战。本研究提出了一种将模糊期望最大化(FEM)与扩散模型相结合的新方法,以增强音素预测,旨在提高构音障碍语音转换的质量。所提出的方法将FEM聚类与扩散概率模型(DPM)相结合。扩散模型模拟噪声的添加和去除,以增强语音信号的鲁棒性,而FEM则迭代优化音素边界,减少不确定性。该系统使用萨尔兰大学语音障碍数据集进行训练,该数据集由构音障碍语音样本和正常语音样本组成,转换过程在梅尔频谱图域中表示。该框架采用主观(平均意见得分,MOS)和客观(单词错误率,WER)指标进行评估,并辅以对比研究。实验结果表明,所提出的方法显著提高了音素预测准确性和整体语音转换质量。与StarGAN-VC和CycleGAN-VC等现有模型相比,它在自然度、清晰度和说话人相似度方面获得了更高的MOS。此外,所提出的方法在轻度和重度构音障碍病例中均表现出较低的WER,表明在生成可理解语音方面具有更好的性能。FEM与扩散模型的集成在处理构音障碍语音的不规则性方面有显著改进。对比研究证明了该方法的鲁棒性,表明即使没有说话人编码器,它也能保持语音的自然度和清晰度。这些发现表明,所提出的方法可为开发更可靠的构音障碍患者辅助通信技术做出贡献,为个性化言语治疗的未来发展提供了一个有前景的基础。