IEEE J Biomed Health Inform. 2020 Oct;24(10):2942-2949. doi: 10.1109/JBHI.2019.2961844. Epub 2019 Dec 25.
Amyotrophic lateral sclerosis (ALS) results in progressive paralysis of voluntary muscles throughout the body. As speech deteriorates, individuals rely on pre-programmed messages available on commercial speech generating devices to communicate using one of the generic electronic voices on the device. To replace these generic voices and restore vocal identity, our aim is to develop personalized voices for people with ALS via the approach of voice conversion. The task is challenging because very few people have large quantities of their premorbid healthy speech recorded. Therefore, we have to rely on small quantities of dysarthric speech concomitant with an individual's disease stage. Further, progressive fatigue prohibits acquisition of large speech datasets and individuals display a range of dysarthria severities resulting from breathing, voice, articulation, resonance, and prosody disturbances. As the first step to address these problems, we use healthy source speakers and propose the approach of combining a structured sparse spectral transform with multiple linear regression-based frequency warping prediction for spectral conversion, and interpolating the transformed spectral frames for speech rate modification. Our experimental data included four healthy source speakers from the ARCTIC dataset, and four target ALS speakers with mild to severe dysarthria, forming 16 speaker pairs. Subjective listening evaluations showed that on average, (i) the proposed approach improved speech intelligibility by about 80% over the target speakers' speech, (ii) the converted voice was 3 times more similar to the target speakers' speech than to the source speakers' speech, and (iii) the converted speech quality was close to the MOS scale "good" relative to the source speakers' speech being "excellent."
肌萎缩侧索硬化症(ALS)会导致全身随意肌逐渐瘫痪。随着言语功能的恶化,患者依赖商业语音生成设备上预先编程的信息,使用设备上的通用电子声音之一进行交流。为了替代这些通用声音并恢复声音特征,我们的目标是通过语音转换技术为 ALS 患者开发个性化声音。这项任务极具挑战性,因为只有极少数人有大量的预患病健康语音记录。因此,我们必须依赖与个体疾病阶段同时存在的少量构音障碍语音。此外,进行性疲劳会阻碍大语音数据集的获取,并且个体表现出一系列因呼吸、语音、发音、共鸣和韵律障碍导致的构音障碍严重程度。作为解决这些问题的第一步,我们使用健康的源说话人,并提出了一种结合结构稀疏谱变换和基于多元线性回归的频率扭曲预测的方法来进行谱转换,并对变换后的谱帧进行内插以实现语速修改。我们的实验数据包括来自 ARCTIC 数据集的四位健康源说话人,以及四位患有轻度至重度构音障碍的目标 ALS 说话人,共形成 16 对说话人。主观听力评估表明,平均而言,(i)与目标说话人的语音相比,该方法提高了语音可懂度约 80%,(ii)转换后的语音与目标说话人的语音的相似性是与源说话人的语音的相似性的 3 倍,(iii)转换后的语音质量与源说话人的语音的“极好”相比接近 MOS 等级“良好”。