University of Michigan.
J Speech Lang Hear Res. 2024 Nov 7;67(11):4203-4215. doi: 10.1044/2024_JSLHR-24-00070. Epub 2024 Oct 8.
This work introduces updated transcripts, disfluency annotations, and word timings for FluencyBank, which we refer to as FluencyBank Timestamped. This data set will enable the thorough analysis of how speech processing models (such as speech recognition and disfluency detection models) perform when evaluated with typical speech versus speech from people who stutter (PWS).
We update the FluencyBank data set, which includes audio recordings from adults who stutter, to explore the robustness of speech processing models. Our update (semi-automated with manual review) includes new transcripts with timestamps and disfluency labels corresponding to each token in the transcript. Our disfluency labels capture typical disfluencies (filled pauses, repetitions, revisions, and partial words), and we explore how speech model performance compares for Switchboard (typical speech) and FluencyBank Timestamped. We present benchmarks for three speech tasks: intended speech recognition, text-based disfluency detection, and audio-based disfluency detection. For the first task, we evaluate how well Whisper performs for intended speech recognition (i.e., transcribing speech without disfluencies). For the next tasks, we evaluate how well a Bidirectional Embedding Representations from Transformers (BERT) text-based model and a Whisper audio-based model perform for disfluency detection. We select these models, BERT and Whisper, as they have shown high accuracies on a broad range of tasks in their language and audio domains, respectively.
For the transcription task, we calculate an intended speech word error rate (isWER) between the model's output and the speaker's intended speech (i.e., speech without disfluencies). We find isWER is comparable between Switchboard and FluencyBank Timestamped, but that Whisper transcribes filled pauses and partial words at higher rates in the latter data set. Within FluencyBank Timestamped, isWER increases with stuttering severity. For the disfluency detection tasks, we find the models detect filled pauses, revisions, and partial words relatively well in FluencyBank Timestamped, but performance drops substantially for repetitions because the models are unable to generalize to the different types of repetitions (e.g., multiple repetitions and sound repetitions) from PWS. We hope that FluencyBank Timestamped will allow researchers to explore closing performance gaps between typical speech and speech from PWS.
Our analysis shows that there are gaps in speech recognition and disfluency detection performance between typical speech and speech from PWS. We hope that FluencyBank Timestamped will contribute to more advancements in training robust speech processing models.
本研究介绍了 FluencyBank 的更新转录本、不流畅标注和单词时间戳,我们称之为 FluencyBank Timestamped。该数据集将使研究人员能够深入分析语音处理模型(如语音识别和不流畅检测模型)在评估典型语音与口吃者(PWS)语音时的表现。
我们更新了包括口吃成年人的音频记录在内的 FluencyBank 数据集,以探索语音处理模型的稳健性。我们的更新(半自动加人工审查)包括带有时间戳和标记的新转录本,这些标记对应于转录本中的每个标记。我们的不流畅标签捕捉了典型的不流畅现象(填充停顿、重复、修订和部分词语),并探讨了语音模型在 Switchboard(典型语音)和 FluencyBank Timestamped 中的性能差异。我们为三个语音任务提供了基准:意图语音识别、基于文本的不流畅检测和基于音频的不流畅检测。对于第一个任务,我们评估 Whisper 模型在意图语音识别方面的性能(即,转录没有不流畅现象的语音)。对于下一个任务,我们评估基于双向嵌入表示的转换器(BERT)文本模型和基于 Whisper 的音频模型在不流畅检测方面的性能。我们选择这些模型,BERT 和 Whisper,是因为它们在语言和音频领域的广泛任务中都表现出了很高的准确性。
对于转录任务,我们计算模型输出与说话者意图语音(即没有不流畅现象的语音)之间的意图语音词错误率(isWER)。我们发现 Switchboard 和 FluencyBank Timestamped 之间的 isWER 相当,但 Whisper 在后者数据集中对填充停顿和部分词语的转录率更高。在 FluencyBank Timestamped 中,isWER 随着口吃严重程度的增加而增加。对于不流畅检测任务,我们发现模型在 FluencyBank Timestamped 中相对较好地检测到填充停顿、修订和部分词语,但对于重复的检测性能大幅下降,因为模型无法将重复类型(例如多次重复和声音重复)从 PWS 中推广。我们希望 FluencyBank Timestamped 将有助于研究人员探索典型语音和 PWS 语音之间的性能差距。
我们的分析表明,在典型语音和 PWS 语音之间,在语音识别和不流畅检测性能方面存在差距。我们希望 FluencyBank Timestamped 将有助于在训练鲁棒语音处理模型方面取得更多进展。