• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在急诊眼科中将大语言模型用作决策支持工具。

Using large language models as decision support tools in emergency ophthalmology.

作者信息

Kreso Ante, Boban Zvonimir, Kabic Sime, Rada Filip, Batistic Darko, Barun Ivana, Znaor Ljubo, Kumric Marko, Bozic Josko, Vrdoljak Josip

机构信息

University Hospital Split, Department for Ophthalmology, Croatia.

University of Split School of Medicine, Department for Medical Physics, Croatia.

出版信息

Int J Med Inform. 2025 Jul;199:105886. doi: 10.1016/j.ijmedinf.2025.105886. Epub 2025 Mar 22.

DOI:10.1016/j.ijmedinf.2025.105886
PMID:40147415
Abstract

BACKGROUND

Large language models (LLMs) have shown promise in various medical applications, but their potential as decision support tools in emergency ophthalmology remains unevaluated using real-world cases.

OBJECTIVES

We assessed the performance of state-of-the-art LLMs (GPT-4, GPT-4o, and Llama-3-70b) as decision support tools in emergency ophthalmology compared to human experts.

METHODS

In this prospective comparative study, LLM-generated diagnoses and treatment plans were evaluated against those determined by certified ophthalmologists using 73 anonymized emergency cases from the University Hospital of Split. Two independent expert ophthalmologists graded both LLM and human-generated reports using a 4-point Likert scale.

RESULTS

Human experts achieved a mean score of 3.72 (SD = 0.50), while GPT-4 scored 3.52 (SD = 0.64) and Llama-3-70b scored 3.48 (SD = 0.48). GPT-4o had lower performance with 3.20 (SD = 0.81). Significant differences were found between human and LLM reports (P < 0.001), specifically between human scores and GPT-4o. GPT-4 and Llama-3-70b showed performance comparable to ophthalmologists, with no statistically significant differences.

CONCLUSION

Large language models demonstrated accuracy as decision support tools in emergency ophthalmology, with performance comparable to human experts, suggesting potential for integration into clinical practice.

摘要

背景

大语言模型(LLMs)在各种医学应用中已显示出前景,但在急诊眼科作为决策支持工具的潜力,仍未通过真实病例进行评估。

目的

我们评估了与人类专家相比,最先进的大语言模型(GPT-4、GPT-4o和Llama-3-70b)在急诊眼科作为决策支持工具的性能。

方法

在这项前瞻性比较研究中,针对由斯普利特大学医院提供的73例匿名急诊病例,将大语言模型生成的诊断和治疗方案与认证眼科医生确定的方案进行比较评估。两位独立的眼科专家使用4点李克特量表对大语言模型和人类生成的报告进行评分。

结果

人类专家的平均得分为3.72(标准差=0.50),而GPT-4得分为3.52(标准差=0.64),Llama-3-70b得分为3.48(标准差=0.48)。GPT-4o表现较差,得分为3.20(标准差=0.81)。在人类和大语言模型的报告之间发现了显著差异(P<0.001),特别是在人类得分与GPT-4o之间。GPT-4和Llama-3-70b的表现与眼科医生相当,无统计学显著差异。

结论

大语言模型在急诊眼科作为决策支持工具表现出准确性,性能与人类专家相当,表明其有整合到临床实践中的潜力。

相似文献

1
Using large language models as decision support tools in emergency ophthalmology.在急诊眼科中将大语言模型用作决策支持工具。
Int J Med Inform. 2025 Jul;199:105886. doi: 10.1016/j.ijmedinf.2025.105886. Epub 2025 Mar 22.
2
Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.在从自由文本报告中提取胸部X光检查结果方面,确保隐私的开放权重大型语言模型与封闭权重的GPT-4o具有竞争力。
Radiology. 2025 Jan;314(1):e240895. doi: 10.1148/radiol.240895.
3
Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study.大型语言模型在专家级重症监护问题上的比较评估与性能:一项基准研究。
Crit Care. 2025 Feb 10;29(1):72. doi: 10.1186/s13054-025-05302-0.
4
AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.家庭护理中的人工智能——对用于未来非正式护理人员培训的大语言模型的评估:观察性比较案例研究
J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.
5
Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study.大语言模型在眼科领域接近专家级临床知识和推理能力:一项直接比较的横断面研究。
PLOS Digit Health. 2024 Apr 17;3(4):e0000341. doi: 10.1371/journal.pdig.0000341. eCollection 2024 Apr.
6
Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders.评估大型语言模型的性能以支持原发性免疫疾病患者的诊断和管理。
J Allergy Clin Immunol. 2025 Feb 14. doi: 10.1016/j.jaci.2025.02.004.
7
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较:大型语言模型、ChatGPT 和未经训练的急诊医生:一项对比研究。
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.
8
A comparative study of GPT-4o and human ophthalmologists in glaucoma diagnosis.GPT-4o与人类眼科医生在青光眼诊断中的比较研究。
Sci Rep. 2024 Dec 5;14(1):30385. doi: 10.1038/s41598-024-80917-x.
9
Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery.在耳鼻喉科、头颈外科中,评估本地运行和基于网络的大语言模型与人类委员会建议的决策情况。
Eur Arch Otorhinolaryngol. 2025 Mar;282(3):1593-1607. doi: 10.1007/s00405-024-09153-3. Epub 2025 Jan 10.
10
Extracting Pulmonary Embolism Diagnoses From Radiology Impressions Using GPT-4o: Large Language Model Evaluation Study.使用GPT-4o从放射学诊断印象中提取肺栓塞诊断:大语言模型评估研究
JMIR Med Inform. 2025 Apr 9;13:e67706. doi: 10.2196/67706.

引用本文的文献

1
Clinical decision-making for uveal melanoma radiotherapy: comparative performance of experienced radiation oncologists and leading generative AI models.葡萄膜黑色素瘤放疗的临床决策:经验丰富的放射肿瘤学家与领先的生成式人工智能模型的比较表现
Front Oncol. 2025 Aug 14;15:1605916. doi: 10.3389/fonc.2025.1605916. eCollection 2025.
2
Diagnostic Performance of Publicly Available Large Language Models in Corneal Diseases: A Comparison with Human Specialists.公开可用的大语言模型在角膜疾病中的诊断性能:与人类专家的比较
Diagnostics (Basel). 2025 May 13;15(10):1221. doi: 10.3390/diagnostics15101221.