利用专业放射科医生的专业知识来增强大语言模型对人工智能生成的放射学报告的评估。

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for AI-generated Radiology Reports.

作者信息

Zhu Qingqing, Chen Xiuying, Jin Qiao, Hou Benjamin, Mathai Tejas Sudharshan, Mukherjee Pritam, Gao Xin, Summers Ronald M, Lu Zhiyong

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Bioscience Reseach Center, King Abdullah University of Science & Technology, Saudi Arabia.

出版信息

Proc (IEEE Int Conf Healthc Inform). 2024 Jun;2024:402-411. doi: 10.1109/ichi61247.2024.00058. Epub 2024 Aug 22.

DOI:10.1109/ichi61247.2024.00058

PMID:39691157

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11651630/

Abstract

In radiology, Artificial Intelligence (AI) has significantly advanced report generation, but automatic evaluation of these AI-produced reports remains challenging. Current metrics, such as Conventional Natural Language Generation (NLG) and Clinical Efficacy (CE), often fall short in capturing the semantic intricacies of clinical contexts or overemphasize clinical details, undermining report clarity. To overcome these issues, our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs), like GPT-3.5 and GPT-4. Utilizing In-Context Instruction Learning (ICIL) and Chain of Thought (CoT) reasoning, our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI-generated reports. This is further enhanced by a Regression model that aggregates sentence evaluation scores. Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a correlation that is 0.48, outperforming the METEOR metric by 0.19, while our "Regressed GPT-4" model shows even greater alignment(0.64) with expert evaluations, exceeding the best existing metric by a 0.35 margin. Moreover, the robustness of our explanations has been validated through a thorough iterative strategy. We plan to publicly release annotations from radiology experts, setting a new standard for accuracy in future assessments. This underscores the potential of our approach in enhancing the quality assessment of AI-driven medical reports.

摘要

在放射学领域，人工智能（AI）已在报告生成方面取得了显著进展，但对这些由人工智能生成的报告进行自动评估仍具有挑战性。当前的指标，如传统自然语言生成（NLG）和临床疗效（CE），在捕捉临床背景的语义复杂性方面往往存在不足，或者过度强调临床细节，从而影响了报告的清晰度。为克服这些问题，我们提出的方法将专业放射科医生的专业知识与大语言模型（LLM）（如GPT-3.5和GPT-4）相结合。利用上下文指令学习（ICIL）和思维链（CoT）推理，我们的方法使大语言模型的评估符合放射科医生的标准，从而能够对人类生成的报告和人工智能生成的报告进行详细比较。通过一个汇总句子评估分数的回归模型，这一点得到了进一步加强。实验结果表明，我们的“详细GPT-4（5次示例）”模型实现了0.48的相关性，比METEOR指标高出0.19，而我们的“回归GPT-4”模型与专家评估的一致性更高（0.64），比现有最佳指标高出0.35。此外，我们解释的稳健性已通过全面的迭代策略得到验证。我们计划公开发布放射科专家的注释，为未来评估的准确性设定新标准。这凸显了我们的方法在提高人工智能驱动的医学报告质量评估方面的潜力。

相似文献

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for AI-generated Radiology Reports.利用专业放射科医生的专业知识来增强大语言模型对人工智能生成的放射学报告的评估。

Proc (IEEE Int Conf Healthc Inform). 2024 Jun;2024:402-411. doi: 10.1109/ichi61247.2024.00058. Epub 2024 Aug 22.

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports.利用专业放射科医生的专业知识提升大语言模型对放射学报告的评估能力。

ArXiv. 2024 Feb 17:arXiv:2401.16578v3.

Large language models as an academic resource for radiologists stepping into artificial intelligence research.大语言模型作为放射科医生涉足人工智能研究的学术资源。

Curr Probl Diagn Radiol. 2025 May-Jun;54(3):342-348. doi: 10.1067/j.cpradiol.2024.12.004. Epub 2024 Dec 10.

Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports.评估生成式预训练变换器4（GPT-4）在规范放射学报告方面的性能。

Eur Radiol. 2024 Jun;34(6):3566-3574. doi: 10.1007/s00330-023-10384-x. Epub 2023 Nov 8.

GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3.基于微调的Llama 3由GPT驱动的放射学报告生成

Bioengineering (Basel). 2024 Oct 18;11(10):1043. doi: 10.3390/bioengineering11101043.

From jargon to clarity: Improving the readability of foot and ankle radiology reports with an artificial intelligence large language model.从行话到清晰明了：利用人工智能大语言模型提高足踝放射学报告的可读性

Foot Ankle Surg. 2024 Jun;30(4):331-337. doi: 10.1016/j.fas.2024.01.008. Epub 2024 Feb 5.

AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.家庭护理中的人工智能——对用于未来非正式护理人员培训的大语言模型的评估：观察性比较案例研究

J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.

Aligning large language models with radiologists by reinforcement learning from AI feedback for chest CT reports.通过基于人工智能反馈的强化学习使大型语言模型与放射科医生在胸部CT报告方面保持一致。

Eur J Radiol. 2025 Mar;184:111984. doi: 10.1016/j.ejrad.2025.111984. Epub 2025 Feb 6.

Evaluating Large Language Models for Enhancing Radiology Specialty Examination: A Comparative Study with Human Performance.评估用于增强放射学专业考试的大语言模型：与人类表现的对比研究。

Acad Radiol. 2025 May 27. doi: 10.1016/j.acra.2025.05.023.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断：对流行的大型语言模型的定性研究。

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

引用本文的文献

Two stage large language model approach enhancing entity classification and relationship mapping in radiology reports.两阶段大语言模型方法增强放射学报告中的实体分类和关系映射

Sci Rep. 2025 Aug 27;15(1):31550. doi: 10.1038/s41598-025-16213-z.

Developing artificial intelligence tools for institutional review board pre-review: A pilot study on ChatGPT's accuracy and reproducibility.开发用于机构审查委员会预审查的人工智能工具：关于ChatGPT准确性和可重复性的初步研究。

PLOS Digit Health. 2025 Jun 30;4(6):e0000695. doi: 10.1371/journal.pdig.0000695. eCollection 2025 Jun.

Harnessing GPT-4 for automated error detection in pathology reports: Implications for oncology diagnostics.利用GPT-4进行病理报告中的自动错误检测：对肿瘤诊断的影响。

Digit Health. 2025 May 29;11:20552076251346703. doi: 10.1177/20552076251346703. eCollection 2025 Jan-Dec.

本文引用的文献

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.医学领域多模态GPT-4视觉专家级准确性背后的隐藏缺陷。

NPJ Digit Med. 2024 Jul 23;7(1):190. doi: 10.1038/s41746-024-01185-7.

Utilizing Longitudinal Chest X-Rays and Reports to Pre-fill Radiology Reports.利用胸部纵向X光片及报告预填充放射学报告。

Med Image Comput Comput Assist Interv. 2023 Oct;14224:189-198. doi: 10.1007/978-3-031-43904-9_19. Epub 2023 Oct 1.

Opportunities and challenges for ChatGPT and large language models in biomedicine and health.ChatGPT 和大型语言模型在生物医学和健康领域的机遇与挑战。

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad493.

ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.ChatGPT 让医学文献通俗易懂：简化放射学报告的探索性案例研究。

Eur Radiol. 2024 May;34(5):2817-2825. doi: 10.1007/s00330-023-10213-1. Epub 2023 Oct 5.

A survey on automatic generation of medical imaging reports based on deep learning.基于深度学习的医学影像报告自动生成研究综述。

Biomed Eng Online. 2023 May 18;22(1):48. doi: 10.1186/s12938-023-01113-y.

CheXPrune: sparse chest X-ray report generation model using multi-attention and one-shot global pruning.CheXPrune：使用多注意力机制和一次性全局剪枝的稀疏胸部X光报告生成模型

J Ambient Intell Humaniz Comput. 2023;14(6):7485-7497. doi: 10.1007/s12652-022-04454-z. Epub 2022 Nov 1.

Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training.通过视觉语言预训练实现医学图像与文本的多模态理解与生成

IEEE J Biomed Health Inform. 2022 Dec;26(12):6070-6080. doi: 10.1109/JBHI.2022.3207502. Epub 2022 Dec 7.

Deep Learning to Classify Radiology Free-Text Reports.深度学习在放射科自由文本报告分类中的应用

Radiology. 2018 Mar;286(3):845-852. doi: 10.1148/radiol.2017171115. Epub 2017 Nov 13.

Understanding interobserver agreement: the kappa statistic.理解观察者间一致性：kappa统计量。

Fam Med. 2005 May;37(5):360-3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。