开发和基准测试韩国语音识别模型，用于放射肿瘤学临床中的医患对话。

Department of Radiation Oncology, Seoul National University Hospital, South Korea; Department of Radiation Oncology, Seoul National University, South Korea.

Int J Med Inform. 2023 Aug;176:105112. doi: 10.1016/j.ijmedinf.2023.105112. Epub 2023 Jun 1.

BACKGROUND

The purpose of this study is to develop an audio speech recognition (ASR) deep learning model for transcribing clinician-patient conversations in radiation oncology clinics.

METHODS

We finetuned the pre-trained English QuartzNet 15x5 model for the Korean language using a publicly available dataset of simulated situations between clinicians and patients. Subsequently, real conversations between a radiation oncologist and 115 patients in actual clinics were then prospectively collected, transcribed, and divided into training (30.26 h) and testing (0.79 h) sets. These datasets were used to develop the ASR model for clinics, which was benchmarked against other ASR models, including the 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model.'

RESULTS

The pre-trained English ASR model was successfully fine-tuned and converted to recognize the Korean language, resulting in a character error rate (CER) of 0.17. However, we found that this performance was not sustained on the real conversation dataset. To address this, we further fine-tuned the model, resulting in an improved CER of 0.26. Other developed ASR models, including 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model.', showed a CER of 0.31, 0.28, and 0.25, respectively. On the general Korean conversation dataset, 'zeroth-korean,' our model showed a CER of 0.44, while the 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model' resulted in CERs of 0.26, 0.98, and 0.99, respectively.

CONCLUSION

In conclusion, we developed a Korean ASR model to transcribe real conversations between a radiation oncologist and patients. The performance of the model was deemed acceptable for both specific and general purposes, compared to other models. We anticipate that this model will reduce the time required for clinicians to document the patient's chief complaints or side effects.

背景

本研究旨在开发一种用于转录放射肿瘤学临床医患对话的音频语音识别（ASR）深度学习模型。

方法

我们使用公开的模拟医患对话数据集对预训练的英语 QuartzNet 15x5 模型进行韩语微调。随后，前瞻性地收集了实际临床中放射肿瘤学家与 115 名患者的真实对话，并将其转录为训练集（30.26 小时）和测试集（0.79 小时）。我们使用这些数据集开发了针对临床的 ASR 模型，并将其与其他 ASR 模型（包括“Whisper large”、“Riva Citrinet-1024 韩语模型”和“Riva Conformer 韩语模型”）进行了基准测试。

结果

预训练的英语 ASR 模型成功地进行了微调并转换为识别韩语，其字符错误率（CER）为 0.17。然而，我们发现该模型在真实对话数据集上的性能并不稳定。为了解决这个问题，我们进一步对模型进行了微调，从而将 CER 提高到了 0.26。其他开发的 ASR 模型，包括“Whisper large”、“Riva Citrinet-1024 韩语模型”和“Riva Conformer 韩语模型”，其 CER 分别为 0.31、0.28 和 0.25。在一般的韩语对话数据集中，我们的模型的 CER 为 0.44，而“Whisper large”、“Riva Citrinet-1024 韩语模型”和“Riva Conformer 韩语模型”的 CER 分别为 0.26、0.98 和 0.99。