Suppr超能文献

用于简化介入放射学报告的大语言模型:一项比较分析

Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.

作者信息

Can Elif, Uller Wibke, Vogt Katharina, Doppler Michael C, Busch Felix, Bayerl Nadine, Ellmann Stephan, Kader Avan, Elkilany Aboelyazid, Makowski Marcus R, Bressem Keno K, Adams Lisa C

机构信息

Department of Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Germany (E.C., W.U., K.V., M.C.D.).

Department of Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Germany (E.C., W.U., K.V., M.C.D.).

出版信息

Acad Radiol. 2025 Feb;32(2):888-898. doi: 10.1016/j.acra.2024.09.041. Epub 2024 Sep 30.

Abstract

PURPOSE

To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports.

METHODS

Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis.

RESULTS

Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84).

CONCLUSIONS

GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation.

CLINICAL RELEVANCE/APPLICATIONS: With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications.

摘要

目的

对包括专有模型(GPT-4、GPT-3.5 Turbo、Claude-3-Opus和Gemini Ultra)和开源模型(Mistral-7b和Mistral-8×7b)在内的主流大语言模型在简化109份介入放射学报告方面的性能进行定量和定性评估与比较。

方法

使用五点李克特量表对准确性、完整性、清晰度、临床相关性、自然度和错误率(包括破坏信任和治疗后不当行为错误)进行定性性能评估。使用弗莱什易读性(FRE)、弗莱什-金凯德年级水平(FKGL)、雾度指数(SMOG Index)和戴尔-查尔可读性得分(DCRS)评估定量可读性。采用配对t检验和Bonferroni校正的p值进行统计分析。

结果

定性评估显示,在评估的任何指标上,GPT-4和Claude-3-Opus之间均无显著差异(所有Bonferroni校正的p值:p = 1),而在五个定性指标上,它们的表现优于其他评估模型(p < 0.001)。GPT-4的内容和破坏信任错误最少,Claude-3-Opus次之。然而,所有模型都表现出一定程度的破坏信任和治疗后不当行为错误,少样本提示的GPT-4-Turbo和GPT-3.5-Turbo错误率最低,Mistral-7B和Mistral-8×7B最高。在定量方面,GPT-4在所有可读性指标上均超过Claude-3-Opus(所有p < 0.001),GPT-4的FRE中位数得分为69.01(IQR:64.88 - 73.14),而Claude-3-Opus为59.74(IQR:55.47 - 64.01)。GPT-4也优于GPT-3.5-Turbo和Gemini Ultra(均p < 0.001)。评分者间信度很强(κ = 0.77 - 0.84)。

结论

GPT-4和Claude-3-Opus在生成简化的介入放射学报告方面表现出卓越性能,但所有模型都存在错误,包括破坏信任错误,这凸显了在临床应用前进一步完善和验证的必要性。

临床相关性/应用:随着介入放射学(IR)程序的复杂性增加以及电子健康记录的可用性不断提高,简化IR报告对于提高患者理解和临床决策至关重要。本研究深入了解了各种大语言模型在重写IR报告方面的性能,有助于为以患者为中心的临床应用选择最合适的模型。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验