Suppr超能文献

在急诊眼科中将大语言模型用作决策支持工具。

Using large language models as decision support tools in emergency ophthalmology.

作者信息

Kreso Ante, Boban Zvonimir, Kabic Sime, Rada Filip, Batistic Darko, Barun Ivana, Znaor Ljubo, Kumric Marko, Bozic Josko, Vrdoljak Josip

机构信息

University Hospital Split, Department for Ophthalmology, Croatia.

University of Split School of Medicine, Department for Medical Physics, Croatia.

出版信息

Int J Med Inform. 2025 Jul;199:105886. doi: 10.1016/j.ijmedinf.2025.105886. Epub 2025 Mar 22.

Abstract

BACKGROUND

Large language models (LLMs) have shown promise in various medical applications, but their potential as decision support tools in emergency ophthalmology remains unevaluated using real-world cases.

OBJECTIVES

We assessed the performance of state-of-the-art LLMs (GPT-4, GPT-4o, and Llama-3-70b) as decision support tools in emergency ophthalmology compared to human experts.

METHODS

In this prospective comparative study, LLM-generated diagnoses and treatment plans were evaluated against those determined by certified ophthalmologists using 73 anonymized emergency cases from the University Hospital of Split. Two independent expert ophthalmologists graded both LLM and human-generated reports using a 4-point Likert scale.

RESULTS

Human experts achieved a mean score of 3.72 (SD = 0.50), while GPT-4 scored 3.52 (SD = 0.64) and Llama-3-70b scored 3.48 (SD = 0.48). GPT-4o had lower performance with 3.20 (SD = 0.81). Significant differences were found between human and LLM reports (P < 0.001), specifically between human scores and GPT-4o. GPT-4 and Llama-3-70b showed performance comparable to ophthalmologists, with no statistically significant differences.

CONCLUSION

Large language models demonstrated accuracy as decision support tools in emergency ophthalmology, with performance comparable to human experts, suggesting potential for integration into clinical practice.

摘要

背景

大语言模型(LLMs)在各种医学应用中已显示出前景,但在急诊眼科作为决策支持工具的潜力,仍未通过真实病例进行评估。

目的

我们评估了与人类专家相比,最先进的大语言模型(GPT-4、GPT-4o和Llama-3-70b)在急诊眼科作为决策支持工具的性能。

方法

在这项前瞻性比较研究中,针对由斯普利特大学医院提供的73例匿名急诊病例,将大语言模型生成的诊断和治疗方案与认证眼科医生确定的方案进行比较评估。两位独立的眼科专家使用4点李克特量表对大语言模型和人类生成的报告进行评分。

结果

人类专家的平均得分为3.72(标准差=0.50),而GPT-4得分为3.52(标准差=0.64),Llama-3-70b得分为3.48(标准差=0.48)。GPT-4o表现较差,得分为3.20(标准差=0.81)。在人类和大语言模型的报告之间发现了显著差异(P<0.001),特别是在人类得分与GPT-4o之间。GPT-4和Llama-3-70b的表现与眼科医生相当,无统计学显著差异。

结论

大语言模型在急诊眼科作为决策支持工具表现出准确性,性能与人类专家相当,表明其有整合到临床实践中的潜力。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验