Al Futaisi Najla D, Schuller Björn W, Ringeval Fabien, Pantic Maja
GLAM - Group on Language, Audio & Music, Imperial College London, London, United Kingdom.
Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany.
Front Digit Health. 2025 Aug 18;7:1274675. doi: 10.3389/fdgth.2025.1274675. eCollection 2025.
Early detection is crucial for managing incurable disorders, particularly autism spectrum disorder (ASD). Unfortunately, a considerable number of individuals with ASD receive a late diagnosis or remain undiagnosed. Speech holds a critical role in ASD, as a significant number of affected individuals experience speech impairments or remain non-verbal. To address this, we use speech analysis for automatic ASD recognition in children by classifying their speech as either autistic or typically developing. However, due to the lack of large labelled datasets, we leverage two smaller datasets to explore deep transfer learning methods. We investigate two fine-tuning approaches: (1) Discriminative Fine-Tuning (D-FT), which is pre-trained on a related dataset before being tuned on a similar task, and (2) Wav2Vec 2.0 Fine-Tuning (W2V2-FT), which leverages self-supervised speech representations pre-trained on a larger, unrelated dataset. We perform two distinct classification tasks: (a) a binary task to determine typicality, classifying speech as either that of a typically developing (TD) child or an atypically developing (AD) child; and (b) a four-class diagnosis task, which further classifies atypical cases into ASD, dysphasia (DYS), or pervasive developmental disorder-not otherwise specified (NOS), alongside TD. This research aims to improve early recognition strategies, particularly for individuals with ASD. The findings suggest that transfer learning methods can be a valuable tool for autism recognition from speech. For the typicality classification task (TD vs. AD), the D-FT model achieved the highest test UAR (94.8%), outperforming W2V2-FT (91.5%). In the diagnosis task (TD, ASD, DYS, NOS), D-FT also demonstrated superior performance (60.9% UAR) compared to W2V2-FT (54.3%). These results highlight the potential of transfer learning for speech-based ASD recognition and underscore the challenges of multi-class classification with limited labeled data.
早期检测对于管理无法治愈的疾病至关重要,尤其是自闭症谱系障碍(ASD)。不幸的是,相当数量的自闭症谱系障碍患者得到的是晚期诊断,或者仍未被诊断出来。言语在自闭症谱系障碍中起着关键作用,因为大量受影响的个体存在言语障碍或仍然不会说话。为了解决这个问题,我们通过将儿童的言语分类为自闭症或典型发育来使用言语分析进行儿童自闭症的自动识别。然而,由于缺乏大型标记数据集,我们利用两个较小的数据集来探索深度迁移学习方法。我们研究了两种微调方法:(1)判别式微调(D-FT),它在相关数据集上进行预训练,然后在类似任务上进行微调;(2)Wav2Vec 2.0微调(W2V2-FT),它利用在更大的不相关数据集上预训练的自监督言语表示。我们执行两项不同的分类任务:(a)一个二元任务,以确定典型性,将言语分类为典型发育(TD)儿童或非典型发育(AD)儿童的言语;(b)一个四类诊断任务,它将非典型病例进一步分类为自闭症谱系障碍(ASD)、言语困难(DYS)或未特定说明的广泛性发育障碍(NOS),以及典型发育(TD)。本研究旨在改进早期识别策略,特别是针对自闭症谱系障碍患者。研究结果表明,迁移学习方法可以成为从言语中识别自闭症的有价值工具。对于典型性分类任务(TD与AD),D-FT模型实现了最高的测试未加权平均召回率(UAR)(94.8%),优于W2V2-FT(91.5%)。在诊断任务(TD、ASD、DYS、NOS)中,D-FT也表现出优于W2V2-FT(54.3%)的性能(UAR为60.9%)。这些结果凸显了迁移学习在基于言语的自闭症谱系障碍识别中的潜力,并强调了在标记数据有限的情况下进行多类分类的挑战。