Department of Health Services Research, CAPHRI Care and Public Health Research Institute, Faculty of Health Medicine and Life Sciences, Maastricht University, Maastricht, The Netherlands.
The Living Lab in Ageing & Long-Term Care, Maastricht, The Netherlands.
J Am Med Inform Assoc. 2023 Feb 16;30(3):411-417. doi: 10.1093/jamia/ocac241.
In long-term care (LTC) for older adults, interviews are used to collect client perspectives that are often recorded and transcribed verbatim, which is a time-consuming, tedious task. Automatic speech recognition (ASR) could provide a solution; however, current ASR systems are not effective for certain demographic groups. This study aims to show how data from specific groups, such as older adults or people with accents, can be used to develop an effective ASR.
An initial ASR model was developed using the Mozilla Common Voice dataset. Audio and transcript data (34 h) from interviews with residents, family, and care professionals on quality of care were used. Interview data were continuously processed to reduce the word error rate (WER).
Due to background noise and mispronunciations, an initial ASR model had a WER of 48.3% on interview data. After finetuning using interview data, the average WER was reduced to 24.3%. When tested on speech data from the interviews, a median WER of 22.1% was achieved, with residents displaying the highest WER (22.7%). The resulting ASR model was at least 6 times faster than manual transcription.
The current method decreased the WER substantially, verifying its efficacy. Moreover, using local transcription of audio can be beneficial to the privacy of participants.
The current study shows that interview data from LTC for older adults can be effectively used to improve an ASR model. While the model output does still contain some errors, researchers reported that it saved much time during transcription.
在老年人长期护理(LTC)中,访谈用于收集客户观点,这些观点通常会被逐字记录和转录,这是一项耗时且乏味的任务。自动语音识别(ASR)可以提供一种解决方案;然而,当前的 ASR 系统对于某些人群并不有效。本研究旨在展示如何使用特定群体的数据,例如老年人或带有口音的人,来开发有效的 ASR。
使用 Mozilla Common Voice 数据集开发了初始 ASR 模型。使用居民、家庭和护理专业人员对护理质量的访谈的音频和转录数据(34 小时)。访谈数据不断进行处理以降低字错误率(WER)。
由于背景噪音和发音错误,初始 ASR 模型在访谈数据上的 WER 为 48.3%。使用访谈数据进行微调后,平均 WER 降低到 24.3%。在测试访谈中的语音数据时,实现了中位数 WER 为 22.1%,其中居民的 WER 最高(22.7%)。由此产生的 ASR 模型的速度至少比手动转录快 6 倍。
目前的方法大大降低了 WER,验证了其有效性。此外,使用本地音频转录对于参与者的隐私也可能是有益的。
目前的研究表明,老年人长期护理中的访谈数据可以有效地用于改进 ASR 模型。虽然模型输出仍然包含一些错误,但研究人员报告说它在转录过程中节省了大量时间。