Suppr超能文献

与眼科指南相比,大语言模型提供的信息不一致。

Large language models provide discordant information compared to ophthalmology guidelines.

作者信息

Taloni Andrea, Sangregorio Antonia Carmen, Alessio Giuseppe, Romeo Maria Angela, Coco Giulia, Busin Linda Marie Louise, Sollazzo Andrea, Scorcia Vincenzo, Giannaccare Giuseppe

机构信息

Department of Translational Medicine, University of Ferrara, Ferrara, Italy.

Department of Ophthalmology, Ospedali Privati Forlì "Villa Igea", Forlì, Italy.

出版信息

Sci Rep. 2025 Jul 1;15(1):20556. doi: 10.1038/s41598-025-06404-z.

Abstract

To evaluate the agreement of LLMs with the Preferred Practice Patterns (PPP) guidelines developed by the American Academy of Ophthalmology (AAO). Open questions based on the AAO PPP were submitted to five LLMs: GPT-o1 and GPT-4o by OpenAI, Claude 3.5 Sonnet by Anthropic, Gemini 1.5 Pro by Google, and DeepSeek-R1-Lite-Preview. Questions were classified as "open" or "confirmatory with positive/negative ground-truth answer". Three blinded investigators classified responses as "concordant", "undetermined", or "discordant" compared to the AAO PPP. Undetermined and discordant answers were analyzed to assess harming potential for patients. Responses referencing peer-reviewed articles were reported. In total, 147 questions were submitted to the LLMs. Concordant answers were 135 (91.8%) for GPT-o1, 133 (90.5%) for GPT-4o, 136 (92.5%) for Claude 3.5 Sonnet, 124 (84.4%) for Gemini 1.5 Pro, and 119 (81.0%) for DeepSeek-R1-Lite-Preview (P = 0.006). The highest number of harmful answers was reported for Gemini 1.5 Pro (n = 6, 4.1%), followed by DeepSeek-R1-Lite-Preview (n = 5, 3.4%). Gemini 1.5 Pro was the most transparent model (86 references, 58.5%). Other LLMs referenced papers in 9.5-15.6% of their responses. LLMs can provide discordant answers compared to ophthalmology guidelines, potentially harming patients by delaying diagnosis or recommending suboptimal treatments.

摘要

为评估大型语言模型(LLMs)与美国眼科学会(AAO)制定的首选实践模式(PPP)指南的一致性。基于AAO PPP的开放性问题被提交给五个LLMs:OpenAI的GPT-01和GPT-4,Anthropic的Claude 3.5 Sonnet,谷歌的Gemini 1.5 Pro,以及DeepSeek-R1-Lite-Preview。问题被分类为“开放性”或“具有正/负真实答案的确认性”。与AAO PPP相比,三名盲法研究者将回答分类为“一致”、“未确定”或“不一致”。对未确定和不一致的答案进行分析,以评估对患者的潜在危害。报告了引用同行评审文章的回答。总共向LLMs提交了147个问题。GPT-01的一致答案为135个(91.8%),GPT-4为133个(90.5%),Claude 3.5 Sonnet为136个(92.5%),Gemini 1.5 Pro为124个(84.4%),DeepSeek-R1-Lite-Preview为119个(81.0%)(P = 0.006)。Gemini 1.5 Pro报告的有害答案数量最多(n = 6,4.1%),其次是DeepSeek-R1-Lite-Preview(n = 5,3.4%)。Gemini 1.5 Pro是最透明的模型(86次引用,58.5%)。其他LLMs在其回答中有9.5 - 15.6%引用了论文。与眼科指南相比,LLMs可能会提供不一致的答案,有可能通过延迟诊断或推荐次优治疗方案对患者造成伤害。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验