Suppr超能文献

用于放射学诊断提取的大语言模型的跨机构评估:提示工程视角

Cross-Institutional Evaluation of Large Language Models for Radiology Diagnosis Extraction: A Prompt-Engineering Perspective.

作者信息

Moassefi Mana, Houshmand Sina, Faghani Shahriar, Chang Peter D, Sun Shawn H, Khosravi Bardia, Triphati Aakash G, Rasool Ghulam, Bhatia Neil K, Folio Les, Andriole Katherine P, Gichoya Judy W, Erickson Bradley J

机构信息

Mayo Clinic Artificial Intelligence Lab, Department of Radiology, Mayo Clinic, 200 1st Street, S.W., Rochester, MN, 55905, USA.

Department of Radiology, University of California San Francisco, San Francisco, CA, USA.

出版信息

J Imaging Inform Med. 2025 May 8. doi: 10.1007/s10278-025-01523-5.

Abstract

The rapid evolution of large language models (LLMs) offers promising opportunities for radiology report annotation, aiding in determining the presence of specific findings. This study evaluates the effectiveness of a human-optimized prompt in labeling radiology reports across multiple institutions using LLMs. Six distinct institutions collected 500 radiology reports: 100 in each of 5 categories. A standardized Python script was distributed to participating sites, allowing the use of one common locally executed LLM with a standard human-optimized prompt. The script executed the LLM's analysis for each report and compared predictions to reference labels provided by local investigators. Models' performance using accuracy was calculated, and results were aggregated centrally. The human-optimized prompt demonstrated high consistency across sites and pathologies. Preliminary analysis indicates significant agreement between the LLM's outputs and investigator-provided reference across multiple institutions. At one site, eight LLMs were systematically compared, with Llama 3.1 70b achieving the highest performance in accurately identifying the specified findings. Comparable performance with Llama 3.1 70b was observed at two additional centers, demonstrating the model's robust adaptability to variations in report structures and institutional practices. Our findings illustrate the potential of optimized prompt engineering in leveraging LLMs for cross-institutional radiology report labeling. This approach is straightforward while maintaining high accuracy and adaptability. Future work will explore model robustness to diverse report structures and further refine prompts to improve generalizability.

摘要

大语言模型(LLMs)的快速发展为放射学报告注释提供了充满希望的机会,有助于确定特定发现的存在。本研究评估了一种人工优化提示在使用大语言模型对多个机构的放射学报告进行标注时的有效性。六个不同的机构收集了500份放射学报告:分为5个类别,每个类别100份。一个标准化的Python脚本被分发给参与的站点,允许使用一个常见的本地执行的大语言模型和一个标准的人工优化提示。该脚本对每份报告执行大语言模型的分析,并将预测结果与当地研究人员提供的参考标签进行比较。使用准确率计算模型的性能,并在中心汇总结果。人工优化提示在不同站点和不同病理类型之间表现出高度一致性。初步分析表明,在多个机构中,大语言模型的输出与研究人员提供的参考之间存在显著一致性。在一个站点,系统比较了八个大语言模型,其中Llama 3.1 70b在准确识别指定发现方面表现最佳。在另外两个中心也观察到了与Llama 3.1 70b相当的性能,这表明该模型对报告结构和机构实践的变化具有强大的适应性。我们的研究结果说明了优化提示工程在利用大语言模型进行跨机构放射学报告标注方面的潜力。这种方法简单直接,同时保持了高准确性和适应性。未来的工作将探索模型对不同报告结构的稳健性,并进一步优化提示以提高通用性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验