From the Department of Medical Imaging, Western Health, Footscray, Australia (R.A.S., L.L., W.L.); Alfred Health, Harrison.ai, Monash University, Clayton, Australia (J.C.Y.S.); Department of Surgery, Western Precinct, University of Melbourne, Melbourne, Australia (K.C., J.Y.); and Department of Surgery, Western Health, Melbourne, Australia (J.Y.).
Radiol Artif Intell. 2024 Mar;6(2):e230205. doi: 10.1148/ryai.230205.
This study evaluated the ability of generative large language models (LLMs) to detect speech recognition errors in radiology reports. A dataset of 3233 CT and MRI reports was assessed by radiologists for speech recognition errors. Errors were categorized as clinically significant or not clinically significant. Performances of five generative LLMs-GPT-3.5-turbo, GPT-4, text-davinci-003, Llama-v2-70B-chat, and Bard-were compared in detecting these errors, using manual error detection as the reference standard. Prompt engineering was used to optimize model performance. GPT-4 demonstrated high accuracy in detecting clinically significant errors (precision, 76.9%; recall, 100%; F1 score, 86.9%) and not clinically significant errors (precision, 93.9%; recall, 94.7%; F1 score, 94.3%). Text-davinci-003 achieved F1 scores of 72% and 46.6% for clinically significant and not clinically significant errors, respectively. GPT-3.5-turbo obtained 59.1% and 32.2% F1 scores, while Llama-v2-70B-chat scored 72.8% and 47.7%. Bard showed the lowest accuracy, with F1 scores of 47.5% and 20.9%. GPT-4 effectively identified challenging errors of nonsense phrases and internally inconsistent statements. Longer reports, resident dictation, and overnight shifts were associated with higher error rates. In conclusion, advanced generative LLMs show potential for automatic detection of speech recognition errors in radiology reports. CT, Large Language Model, Machine Learning, MRI, Natural Language Processing, Radiology Reports, Speech, Unsupervised Learning .
本研究评估了生成式大型语言模型(LLM)检测放射科报告中语音识别错误的能力。由放射科医生评估了 3233 份 CT 和 MRI 报告的语音识别错误。将错误分为临床显著和非临床显著两类。使用手动错误检测作为参考标准,比较了五种生成式 LLM(GPT-3.5-turbo、GPT-4、text-davinci-003、Llama-v2-70B-chat 和 Bard)在检测这些错误方面的性能。使用提示工程来优化模型性能。GPT-4 在检测临床显著错误(精确率 76.9%,召回率 100%,F1 得分为 86.9%)和非临床显著错误(精确率 93.9%,召回率 94.7%,F1 得分为 94.3%)方面表现出了很高的准确性。text-davinci-003 在检测临床显著和非临床显著错误方面的 F1 得分分别为 72%和 46.6%。GPT-3.5-turbo 的 F1 得分分别为 59.1%和 32.2%,而 Llama-v2-70B-chat 的 F1 得分为 72.8%和 47.7%。Bard 的准确性最低,F1 得分分别为 47.5%和 20.9%。GPT-4 能够有效地识别出无意义短语和内部不一致语句等具有挑战性的错误。较长的报告、住院医师听写和夜间轮班与更高的错误率相关。总之,先进的生成式 LLM 显示出在放射科报告中自动检测语音识别错误的潜力。