Suppr超能文献

生成式大型语言模型在放射科报告语音识别错误检测中的应用。

Generative Large Language Models for Detection of Speech Recognition Errors in Radiology Reports.

机构信息

From the Department of Medical Imaging, Western Health, Footscray, Australia (R.A.S., L.L., W.L.); Alfred Health, Harrison.ai, Monash University, Clayton, Australia (J.C.Y.S.); Department of Surgery, Western Precinct, University of Melbourne, Melbourne, Australia (K.C., J.Y.); and Department of Surgery, Western Health, Melbourne, Australia (J.Y.).

出版信息

Radiol Artif Intell. 2024 Mar;6(2):e230205. doi: 10.1148/ryai.230205.

Abstract

This study evaluated the ability of generative large language models (LLMs) to detect speech recognition errors in radiology reports. A dataset of 3233 CT and MRI reports was assessed by radiologists for speech recognition errors. Errors were categorized as clinically significant or not clinically significant. Performances of five generative LLMs-GPT-3.5-turbo, GPT-4, text-davinci-003, Llama-v2-70B-chat, and Bard-were compared in detecting these errors, using manual error detection as the reference standard. Prompt engineering was used to optimize model performance. GPT-4 demonstrated high accuracy in detecting clinically significant errors (precision, 76.9%; recall, 100%; F1 score, 86.9%) and not clinically significant errors (precision, 93.9%; recall, 94.7%; F1 score, 94.3%). Text-davinci-003 achieved F1 scores of 72% and 46.6% for clinically significant and not clinically significant errors, respectively. GPT-3.5-turbo obtained 59.1% and 32.2% F1 scores, while Llama-v2-70B-chat scored 72.8% and 47.7%. Bard showed the lowest accuracy, with F1 scores of 47.5% and 20.9%. GPT-4 effectively identified challenging errors of nonsense phrases and internally inconsistent statements. Longer reports, resident dictation, and overnight shifts were associated with higher error rates. In conclusion, advanced generative LLMs show potential for automatic detection of speech recognition errors in radiology reports. CT, Large Language Model, Machine Learning, MRI, Natural Language Processing, Radiology Reports, Speech, Unsupervised Learning .

摘要

本研究评估了生成式大型语言模型(LLM)检测放射科报告中语音识别错误的能力。由放射科医生评估了 3233 份 CT 和 MRI 报告的语音识别错误。将错误分为临床显著和非临床显著两类。使用手动错误检测作为参考标准,比较了五种生成式 LLM(GPT-3.5-turbo、GPT-4、text-davinci-003、Llama-v2-70B-chat 和 Bard)在检测这些错误方面的性能。使用提示工程来优化模型性能。GPT-4 在检测临床显著错误(精确率 76.9%,召回率 100%,F1 得分为 86.9%)和非临床显著错误(精确率 93.9%,召回率 94.7%,F1 得分为 94.3%)方面表现出了很高的准确性。text-davinci-003 在检测临床显著和非临床显著错误方面的 F1 得分分别为 72%和 46.6%。GPT-3.5-turbo 的 F1 得分分别为 59.1%和 32.2%,而 Llama-v2-70B-chat 的 F1 得分为 72.8%和 47.7%。Bard 的准确性最低,F1 得分分别为 47.5%和 20.9%。GPT-4 能够有效地识别出无意义短语和内部不一致语句等具有挑战性的错误。较长的报告、住院医师听写和夜间轮班与更高的错误率相关。总之,先进的生成式 LLM 显示出在放射科报告中自动检测语音识别错误的潜力。

相似文献

1
Generative Large Language Models for Detection of Speech Recognition Errors in Radiology Reports.
Radiol Artif Intell. 2024 Mar;6(2):e230205. doi: 10.1148/ryai.230205.
2
Generative Large Language Models Trained for Detecting Errors in Radiology Reports.
Radiology. 2025 May;315(2):e242575. doi: 10.1148/radiol.242575.
4
Multilingual feasibility of GPT-4o for automated Voice-to-Text CT and MRI report transcription.
Eur J Radiol. 2025 Jan;182:111827. doi: 10.1016/j.ejrad.2024.111827. Epub 2024 Nov 17.

引用本文的文献

1
From large language models to multimodal AI: a scoping review on the potential of generative AI in medicine.
Biomed Eng Lett. 2025 Aug 22;15(5):845-863. doi: 10.1007/s13534-025-00497-1. eCollection 2025 Sep.
3
In-Context Learning with Large Language Models: A Simple and Effective Approach to Improve Radiology Report Labeling.
Healthc Inform Res. 2025 Jul;31(3):295-309. doi: 10.4258/hir.2025.31.3.295. Epub 2025 Jul 31.
4
Evaluating prompt and data perturbation sensitivity in large language models for radiology reports classification.
JAMIA Open. 2025 Aug 12;8(4):ooaf073. doi: 10.1093/jamiaopen/ooaf073. eCollection 2025 Aug.
5
RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering.
Radiol Artif Intell. 2025 Jun 18:e240476. doi: 10.1148/ryai.240476.
6
Large Language Models in Cancer Imaging: Applications and Future Perspectives.
J Clin Med. 2025 May 8;14(10):3285. doi: 10.3390/jcm14103285.
7
Generative Large Language Models Trained for Detecting Errors in Radiology Reports.
Radiology. 2025 May;315(2):e242575. doi: 10.1148/radiol.242575.
9
Integrating artificial intelligence into radiological cancer imaging: from diagnosis and treatment response to prognosis.
Cancer Biol Med. 2025 Feb 4;22(1):6-13. doi: 10.20892/j.issn.2095-3941.2024.0422.
10
Foundation Models in Radiology: What, How, Why, and Why Not.
Radiology. 2025 Feb;314(2):e240597. doi: 10.1148/radiol.240597.

本文引用的文献

1
Evaluating progress in automatic chest X-ray radiology report generation.
Patterns (N Y). 2023 Aug 3;4(9):100802. doi: 10.1016/j.patter.2023.100802. eCollection 2023 Sep 8.
2
Evaluating GPT4 on Impressions Generation in Radiology Reports.
Radiology. 2023 Jun;307(5):e231259. doi: 10.1148/radiol.231259.
3
Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot.
J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.
6
Automated deidentification of radiology reports combining transformer and "hide in plain sight" rule-based methods.
J Am Med Inform Assoc. 2023 Jan 18;30(2):318-328. doi: 10.1093/jamia/ocac219.
7
Application of a Domain-specific BERT for Detection of Speech Recognition Errors in Radiology Reports.
Radiol Artif Intell. 2022 May 25;4(4):e210185. doi: 10.1148/ryai.210185. eCollection 2022 Jul.
8
How to Create a Great Radiology Report.
Radiographics. 2020 Oct;40(6):1658-1670. doi: 10.1148/rg.2020200020.
10
Trainee radiologist reports as a source of confirmation bias in radiology.
Clin Radiol. 2018 Dec;73(12):1052-1055. doi: 10.1016/j.crad.2018.08.003. Epub 2018 Sep 13.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验