Suppr超能文献

评估大语言模型在生成肺结节随访建议方面的能力。

Evaluation of large language models in generating pulmonary nodule follow-up recommendations.

作者信息

Wen Junzhe, Huang Wanyue, Yan Huzheng, Sun Jie, Dong Mengshi, Li Chao, Qin Jie

机构信息

Department of Radiology, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China.

Department of Interventional Radiology, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China.

出版信息

Eur J Radiol Open. 2025 Apr 30;14:100655. doi: 10.1016/j.ejro.2025.100655. eCollection 2025 Jun.

Abstract

RATIONALE AND OBJECTIVES

To evaluate the performance of large language models (LLMs) in generating clinically follow-up recommendations for pulmonary nodules by leveraging radiological report findings and management guidelines.

MATERIALS AND METHODS

This retrospective study included CT follow-up reports of pulmonary nodules documented by senior radiologists from September 1st, 2023, to April 30th, 2024. Sixty reports were collected for prompting engineering additionally, based on few-shot learning and the Chain of Thought methodology. Radiological findings of pulmonary nodules, along with finally prompt, were input into GPT-4o-mini or ERNIE-4.0-Turbo-8K to generate follow-up recommendations. The AI-generated recommendations were evaluated against radiologist-defined guideline-based standards through binary classification, assessing nodule risk classifications, follow-up intervals, and harmfulness. Performance metrics included sensitivity, specificity, positive/negative predictive values, and F1 score.

RESULTS

On 1009 reports from 996 patients (median age, 50.0 years, IQR, 39.0-60.0 years; 511 male patients), ERNIE-4.0-Turbo-8K and GPT-4o-mini demonstrated comparable performance in both accuracy of follow-up recommendations (94.6 % vs 92.8 %, P = 0.07) and harmfulness rates (2.9 % vs 3.5 %, P = 0.48). In nodules classification, ERNIE-4.0-Turbo-8K and GPT-4o-mini performed similarly with accuracy rates of 99.8 % vs 99.9 % sensitivity of 96.9 % vs 100.0 %, specificity of 99.9 % vs 99.9 %, positive predictive value of 96.9 % vs 96.9 %, negative predictive value of 100.0 % vs 99.9 %, f1-score of 96.9 % vs 98.4 %, respectively.

CONCLUSION

LLMs show promise in providing guideline-based follow-up recommendations for pulmonary nodules, but require rigorous validation and supervision to mitigate potential clinical risks. This study offers insights into their potential role in automated radiological decision support.

摘要

原理与目的

通过利用放射学报告结果和管理指南,评估大语言模型(LLMs)在生成肺结节临床随访建议方面的性能。

材料与方法

这项回顾性研究纳入了2023年9月1日至2024年4月30日期间由资深放射科医生记录的肺结节CT随访报告。另外,基于少样本学习和思维链方法,收集了60份报告用于提示工程。将肺结节的放射学结果以及最终提示输入GPT-4o-mini或ERNIE-4.0-Turbo-8K以生成随访建议。通过二元分类,根据放射科医生定义的基于指南的标准对人工智能生成的建议进行评估,评估结节风险分类、随访间隔和危害性。性能指标包括敏感性、特异性、阳性/阴性预测值和F1分数。

结果

在来自996名患者的1009份报告中(中位年龄50.0岁,IQR为39.0 - 60.0岁;男性患者511名),ERNIE-4.0-Turbo-8K和GPT-4o-mini在随访建议准确性(94.6%对92.8%,P = 0.07)和危害性发生率(2.9%对3.5%,P = 0.48)方面表现出相似的性能。在结节分类中,ERNIE-4.0-Turbo-8K和GPT-4o-mini表现相似,准确率分别为99.8%对99.9%,敏感性为96.9%对100.0%,特异性为99.9%对99.9%,阳性预测值为96.9%对96.9%,阴性预测值为100.0%对99.9%,F1分数为96.9%对98.4%。

结论

大语言模型在为肺结节提供基于指南的随访建议方面显示出前景,但需要严格的验证和监督以减轻潜在的临床风险。本研究为其在自动放射学决策支持中的潜在作用提供了见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d0a6/12088779/af6e899b602f/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验