Google LLC, Mountain View, CA.
MND Association, Northampton, United Kingdom.
J Speech Lang Hear Res. 2024 Nov 7;67(11):4176-4185. doi: 10.1044/2024_JSLHR-24-00045. Epub 2024 Jul 4.
This study examines the effectiveness of automatic speech recognition (ASR) for individuals with speech disorders, addressing the gap in performance between read and conversational ASR. We analyze the factors influencing this disparity and the effect of speech mode-specific training on ASR accuracy.
Recordings of read and conversational speech from 27 individuals with various speech disorders were analyzed using both (a) one speaker-independent ASR system trained and optimized for typical speech and (b) multiple ASR models that were personalized to the speech of the participants with disordered speech. Word error rates were calculated for each speech model, read versus conversational, and subject. Linear mixed-effects models were used to assess the impact of speech mode and disorder severity on ASR accuracy. We investigated nine variables, classified as technical, linguistic, or speech impairment factors, for their potential influence on the performance gap.
We found a significant performance gap between read and conversational speech in both personalized and unadapted ASR models. Speech impairment severity notably impacted recognition accuracy in unadapted models for both speech modes and in personalized models for read speech. Linguistic attributes of utterances were the most influential on accuracy, though atypical speech characteristics also played a role. Including conversational speech samples in model training notably improved recognition accuracy.
We observed a significant performance gap in ASR accuracy between read and conversational speech for individuals with speech disorders. This gap was largely due to the linguistic complexity and unique characteristics of speech disorders in conversational speech. Training personalized ASR models using conversational speech significantly improved recognition accuracy, demonstrating the importance of domain-specific training and highlighting the need for further research into ASR systems capable of handling disordered conversational speech effectively.
本研究考察了自动语音识别(ASR)在言语障碍个体中的有效性,解决了读和会话 ASR 之间性能差距的问题。我们分析了影响这种差异的因素以及针对特定言语模式的训练对 ASR 准确性的影响。
使用(a)一个针对典型言语进行训练和优化的单说话人独立 ASR 系统和(b)针对言语障碍者言语进行个性化的多个 ASR 模型,对 27 名具有各种言语障碍的个体的读和会话言语记录进行了分析。为每个言语模型、读和会话以及个体计算了单词错误率。线性混合效应模型用于评估言语模式和障碍严重程度对 ASR 准确性的影响。我们研究了九种变量,分为技术、语言和言语障碍因素,以评估它们对性能差距的潜在影响。
我们发现,在个性化和未适应的 ASR 模型中,读和会话言语之间都存在显著的性能差距。言语障碍严重程度显著影响了未适应模型中两种言语模式的识别准确性,以及个性化模型中读言语的识别准确性。话语的语言属性对准确性的影响最大,但不典型的言语特征也起了作用。在模型训练中包含会话言语样本显著提高了识别准确性。
我们观察到,言语障碍个体的 ASR 准确性在读和会话言语之间存在显著的性能差距。这种差距主要归因于会话言语中语言复杂性和言语障碍的独特特征。使用会话言语训练个性化 ASR 模型显著提高了识别准确性,这证明了特定领域训练的重要性,并强调了需要进一步研究能够有效处理障碍性会话言语的 ASR 系统。