Kim Do Hyung, Jeong Joo Won, Kang Dayoung, Ahn Taekyung, Hong Yeonjung, Im Younggon, Kim Jaewon, Kim Min Jung, Jang Dae-Hyun
Department of Rehabilitation Medicine, Incheon St Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.
Department of English Language and Literature, Korea University, Seoul, Republic of Korea.
J Med Internet Res. 2025 Jan 14;27:e60520. doi: 10.2196/60520.
Speech sound disorders (SSDs) are common communication challenges in children, typically assessed by speech-language pathologists (SLPs) using standardized tools. However, traditional evaluation methods are time-intensive and prone to variability, raising concerns about reliability.
This study aimed to compare the evaluation outcomes of SLPs and an automatic speech recognition (ASR) model using two standardized SSD assessments in South Korea, evaluating the ASR model's performance.
A fine-tuned wav2vec 2.0 XLS-R model, pretrained on 436,000 hours of adult voice data spanning 128 languages, was used. The model was further trained on 93.6 minutes of children's voices with articulation errors to improve error detection. Participants included children referred to the Department of Rehabilitation Medicine at a general hospital in Incheon, South Korea, from August 19, 2022, to June 14, 2023. Two standardized assessments-the Assessment of Phonology and Articulation for Children (APAC) and the Urimal Test of Articulation and Phonology (U-TAP)-were used, with ASR transcriptions compared to SLP transcriptions.
This study included 30 children aged 3-7 years who were suspected of having SSDs. The phoneme error rates for the APAC and U-TAP were 8.42% (457/5430) and 8.91% (402/4514), respectively, indicating discrepancies between the ASR model and SLP transcriptions across all phonemes. Consonant error rates were 10.58% (327/3090) and 11.86% (331/2790) for the APAC and U-TAP, respectively. On average, there were 2.60 (SD 1.54) and 3.07 (SD 1.39) discrepancies per child for correctly produced phonemes, and 7.87 (SD 3.66) and 7.57 (SD 4.85) discrepancies per child for incorrectly produced phonemes, based on the APAC and U-TAP, respectively. The correlation between SLPs and the ASR model in terms of the percentage of consonants correct was excellent, with an intraclass correlation coefficient of 0.984 (95% CI 0.953-0.994) and 0.978 (95% CI 0.941-0.990) for the APAC and UTAP, respectively. The z scores between SLPs and ASR showed more pronounced differences with the APAC than the U-TAP, with 8 individuals showing discrepancies in the APAC compared to 2 in the U-TAP.
The results demonstrate the potential of the ASR model in assessing children with SSDs. However, its performance varied based on phoneme or word characteristics, highlighting areas for refinement. Future research should include more diverse speech samples, clinical settings, and speech data to strengthen the model's refinement and ensure broader clinical applicability.
语音障碍(SSDs)是儿童常见的沟通障碍,通常由言语语言病理学家(SLP)使用标准化工具进行评估。然而,传统的评估方法耗时且容易出现变异性,引发了对可靠性的担忧。
本研究旨在比较韩国言语语言病理学家(SLP)和自动语音识别(ASR)模型使用两种标准化SSD评估的评估结果,评估ASR模型的性能。
使用在436,000小时跨越128种语言的成人语音数据上预训练的微调wav2vec 2.0 XLS-R模型。该模型在93.6分钟有发音错误的儿童语音上进一步训练,以提高错误检测能力。参与者包括2022年8月19日至2023年6月14日转诊至韩国仁川一家综合医院康复医学科的儿童。使用了两种标准化评估——儿童语音和发音评估(APAC)和发音与语音的尿样测试(U-TAP),将ASR转录与SLP转录进行比较。
本研究纳入了30名3至7岁疑似患有语音障碍的儿童。APAC和U-TAP的音素错误率分别为8.42%(457/5430)和8.91%(402/4514),表明ASR模型和SLP转录在所有音素上存在差异。APAC和U-TAP的辅音错误率分别为10.58%(327/3090)和11.86%(331/2790)。基于APAC和U-TAP,每个正确发音的音素平均每个儿童有2.60(标准差1.54)和3.07(标准差1.39)个差异,每个错误发音的音素平均每个儿童有7.87(标准差3.66)和7.57(标准差4.85)个差异。在正确辅音百分比方面SLP与ASR模型之间的相关性非常好,APAC和U-TAP的组内相关系数分别为0.984(95%置信区间0.953 - 0.994)和0.978(95%置信区间0.941 - 0.990)。SLP和ASR之间的z分数在APAC中比在U-TAP中显示出更明显的差异,APAC中有8人存在差异,而U-TAP中有2人存在差异。
结果证明了ASR模型在评估患有语音障碍儿童方面的潜力。然而,其性能因音素或单词特征而异,突出了需要改进的领域。未来的研究应包括更多样化的语音样本、临床环境和语音数据,以加强模型的改进并确保更广泛的临床适用性。