Kim Jangwon, Kumar Naveen, Tsiartas Andreas, Li Ming, Narayanan Shrikanth S
Signal Analysis and Interpretation Laboratory (SAIL) , University of Southern California, 3710 McClintock Ave., Los Angeles, CA 90089, USA.
Signal Analysis and Interpretation Laboratory (SAIL) , University of Southern California, 3710 McClintock Ave., Los Angeles, CA 90089, USA ; Department of Electrical Engineering, Computer Science, Linguistics and Psychology, University of Southern California (USC), 3620 McClintock Ave., Los Angeles, CA 90089, USA.
Comput Speech Lang. 2015 Jan;29(1):132-144. doi: 10.1016/j.csl.2014.02.001.
Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice and/or in the articulatory mechanisms owing to disease, illness or other physical or biological insult to the production system. Although automatic evaluation of speech intelligibility and quality could come in handy in these scenarios to assist experts in diagnosis and treatment design, the many sources and types of variability often make it a very challenging computational processing problem. In this work we propose novel sentence-level features to capture abnormal variation in the prosodic, voice quality and pronunciation aspects in pathological speech. In addition, we propose a post-classification posterior smoothing scheme which refines the posterior of a test sample based on the posteriors of other test samples. Finally, we perform feature-level fusions and subsystem decision fusion for arriving at a final intelligibility decision. The performances are tested on two pathological speech datasets, the NKI CCRT Speech Corpus (advanced head and neck cancer) and the TORGO database (cerebral palsy or amyotrophic lateral sclerosis), by evaluating classification accuracy without overlapping subjects' data among training and test partitions. Results show that the feature sets of each of the voice quality subsystem, prosodic subsystem, and pronunciation subsystem, offer significant discriminating power for binary intelligibility classification. We observe that the proposed posterior smoothing in the acoustic space can further reduce classification errors. The smoothed posterior score fusion of subsystems shows the best classification performance (73.5% for unweighted, and 72.8% for weighted, average recalls of the binary classes).
病理性语音通常是指由于疾病、伤病或对发声系统的其他物理或生物损伤,导致声音和/或发音机制出现异常而造成的语音失真状况。尽管在这些情况下,自动评估语音清晰度和质量有助于专家进行诊断和设计治疗方案,但众多的变异性来源和类型往往使其成为一个极具挑战性的计算处理问题。在这项工作中,我们提出了新颖的句子级特征,以捕捉病理性语音在韵律、语音质量和发音方面的异常变化。此外,我们提出了一种分类后验平滑方案,该方案基于其他测试样本的后验来细化测试样本的后验。最后,我们进行特征级融合和子系统决策融合,以得出最终的清晰度决策。通过在训练和测试分区中不重叠受试者数据的情况下评估分类准确率,在两个病理性语音数据集上测试了性能,这两个数据集分别是NKI CCRT语音语料库(晚期头颈癌)和TORGO数据库(脑瘫或肌萎缩侧索硬化症)。结果表明,语音质量子系统、韵律子系统和发音子系统的每个特征集,对于二元清晰度分类都具有显著的区分能力。我们观察到,在声学空间中提出的后验平滑可以进一步减少分类错误。子系统的平滑后验分数融合显示出最佳的分类性能(二元类别的未加权平均召回率为73.5%,加权平均召回率为72.8%)。