Ng Si-Ioi, Xu Lingfeng, Mueller Kimberly D, Liss Julie, Berisha Visar
Arizona State University, USA.
University of Wisconsin-Madison, USA.
Interspeech. 2024 Sep;2024:917-921. doi: 10.21437/interspeech.2024-2063.
Speech foundation models are remarkably successful in various consumer applications, prompting their extension to clinical use-cases. This is challenged by small clinical datasets, which precludes effective fine-tuning. We tested the efficacy of two models to classify participants by segmental (Wav2Vec2.0) and suprasegmental (Trillsson) speech analysis windows. Analysis at both time scales has shown differences in the context of cognitive decline. Speakers were classified as healthy controls (HC), Amyloid-β+ (Aβ+), mild cognitive impairment (MCI), or dementia. A subset of W2V2 and Trillsson representations showed large effect size between HC and each risk factor. Cross-validation showed W2V2 consistently outperforms Trillsson. Mean macro-F1 of 54.1%, 63.5%, and 72.0% in were found for classifying Aβ+, MCI, and dementia from HC. Repeatability of Trillsson and W2V2 showed intraclass correlations of 0.30 and 0.41. Reliability of such models must be enhanced for clinical speech analysis and longitudinal tracking.
语音基础模型在各种消费应用中取得了显著成功,促使其扩展到临床用例。然而,小型临床数据集对此构成了挑战,因为这使得有效的微调无法实现。我们测试了两种模型通过分段(Wav2Vec2.0)和超分段(Trillsson)语音分析窗口对参与者进行分类的效果。在这两个时间尺度上的分析都显示了认知衰退背景下的差异。说话者被分类为健康对照(HC)、淀粉样β蛋白阳性(Aβ+)、轻度认知障碍(MCI)或痴呆。W2V2和Trillsson表示的一个子集在HC与每个风险因素之间显示出较大的效应量。交叉验证表明W2V2始终优于Trillsson。从HC中分类Aβ+、MCI和痴呆时,平均宏F1分别为54.1%、63.5%和72.0%。Trillsson和W2V2的可重复性显示组内相关系数分别为0.30和0.41。对于临床语音分析和纵向跟踪,此类模型的可靠性必须提高。