Low Daniel M, Rao Vishwanatha, Randolph Gregory, Song Phillip C, Ghosh Satrajit S
Program in Speech and Hearing Bioscience and Technology, Harvard Medical School, Boston, Massachusetts, United States of America.
McGovern Institute for Brain Research, MIT, Cambridge, Massachusetts, United States of America.
PLOS Digit Health. 2024 May 30;3(5):e0000516. doi: 10.1371/journal.pdig.0000516. eCollection 2024 May.
Detecting voice disorders from voice recordings could allow for frequent, remote, and low-cost screening before costly clinical visits and a more invasive laryngoscopy examination. Our goals were to detect unilateral vocal fold paralysis (UVFP) from voice recordings using machine learning, to identify which acoustic variables were important for prediction to increase trust, and to determine model performance relative to clinician performance. Patients with confirmed UVFP through endoscopic examination (N = 77) and controls with normal voices matched for age and sex (N = 77) were included. Voice samples were elicited by reading the Rainbow Passage and sustaining phonation of the vowel "a". Four machine learning models of differing complexity were used. SHapley Additive exPlanations (SHAP) was used to identify important features. The highest median bootstrapped ROC AUC score was 0.87 and beat clinician's performance (range: 0.74-0.81) based on the recordings. Recording durations were different between UVFP recordings and controls due to how that data was originally processed when storing, which we can show can classify both groups. And counterintuitively, many UVFP recordings had higher intensity than controls, when UVFP patients tend to have weaker voices, revealing a dataset-specific bias which we mitigate in an additional analysis. We demonstrate that recording biases in audio duration and intensity created dataset-specific differences between patients and controls, which models used to improve classification. Furthermore, clinician's ratings provide further evidence that patients were over-projecting their voices and being recorded at a higher amplitude signal than controls. Interestingly, after matching audio duration and removing variables associated with intensity in order to mitigate the biases, the models were able to achieve a similar high performance. We provide a set of recommendations to avoid bias when building and evaluating machine learning models for screening in laryngology.
通过语音记录检测语音障碍可以在进行昂贵的临床就诊和侵入性更强的喉镜检查之前,实现频繁、远程且低成本的筛查。我们的目标是使用机器学习从语音记录中检测单侧声带麻痹(UVFP),确定哪些声学变量对预测很重要以增强可信度,并确定模型相对于临床医生表现的性能。纳入了经内镜检查确诊为UVFP的患者(N = 77)以及年龄和性别匹配的嗓音正常的对照组(N = 77)。通过朗读彩虹段落和持续发元音“a”来获取语音样本。使用了四种不同复杂度的机器学习模型。采用SHapley加性解释(SHAP)来识别重要特征。基于录音,最高的中位数自展ROC AUC分数为0.87,超过了临床医生的表现(范围:0.74 - 0.81)。由于UVFP录音和对照组录音在存储时的原始数据处理方式不同,导致两者的录音时长不同,但我们可以证明这两组都能被分类。而且与直觉相反的是,许多UVFP录音的强度高于对照组,而UVFP患者的嗓音往往较弱,这揭示了特定于数据集的偏差,我们在额外的分析中对其进行了缓解。我们证明,音频时长和强度方面的录音偏差在患者和对照组之间造成了特定于数据集的差异,模型利用这些差异来改进分类。此外,临床医生的评分进一步证明,患者在过度发声,并且录音时的信号幅度高于对照组。有趣的是,在匹配音频时长并去除与强度相关的变量以减轻偏差后,模型能够实现类似的高性能。我们提供了一组建议,以避免在构建和评估用于喉科学筛查的机器学习模型时出现偏差。