大语言模型在疑难临床病例诊断与管理中的比较

Comparison of Large Language Models in Diagnosis and Management of Challenging Clinical Cases.

作者信息

Shanmugam Sujeeth Krishna, Browning David J

机构信息

Department of Ophthalmology, Wake Forest University School of Medicine, Winston-Salem, NC, USA.

出版信息

Clin Ophthalmol. 2024 Nov 12;18:3239-3247. doi: 10.2147/OPTH.S488232. eCollection 2024.

DOI:10.2147/OPTH.S488232

PMID:39555212

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11568767/

Abstract

PURPOSE

Compare large language models (LLMs) in analyzing and responding to a difficult series of ophthalmic cases.

DESIGN

A comparative case series involving LLMs that met inclusion criteria tested on twenty difficult case studies posed in open-text format.

METHODS

Fifteen LLMs accessible to ophthalmologists were tested against twenty case studies published in JAMA Ophthalmology. Each case was presented in identical, open-ended text fashion to each LLM and open-ended responses regarding differential diagnosis, next diagnostic tests and recommended treatments were requested. Responses were recorded and assessed for accuracy against published correct answers. The main outcome was accuracy of LLMs against the correct answers. Secondary outcomes included comparative performance on the differential diagnosis, ancillary testing, and treatment subtests; and readability of responses.

RESULTS

Scores were normally distributed and ranged from 0-35 (with a maximum score of 60) with a mean ± standard deviation of 19 ± 9. Scores for three of the LLMs (ChatGPT 3.5, Claude Pro, and Copilot Pro) were statistically significantly higher than the mean. Two of the high-performing LLMs were paid subscription (Claude Pro and Copilot Pro) and one was free (ChatGPT 3.5). While there were no clinical or statistical differences between ChatGPT 3.5 and Claude Pro, a separation of +5 points, or 0.56 standard deviations, between Copilot Pro and the other highly ranked LLMs was present. Readability of all tested programs were above the AMA (American Medical Association) reading level recommendations to public consumers of eight grade.

CONCLUSION

Subscription LLMs were more prevalent among highly ranked LLMs suggesting that these perform better as ophthalmic assistants. While readability was poor for the average person, the content was understood by a board-certified ophthalmologist. The accuracy of LLMs is not high enough to recommend patient care in standalone mode, but aiding clinicians in patient care and prevent oversights is promising.

摘要

目的

比较大语言模型（LLMs）在分析和应对一系列复杂眼科病例方面的表现。

设计

一项比较病例系列研究，涉及符合纳入标准的大语言模型，对以开放文本格式呈现的20个疑难病例进行测试。

方法

针对《美国医学会眼科杂志》发表的20个病例研究，对眼科医生可使用的15个大语言模型进行测试。每个病例以相同的开放式文本形式呈现给每个大语言模型，并要求提供关于鉴别诊断、下一步诊断测试和推荐治疗的开放式回复。记录回复内容，并根据已发表的正确答案评估其准确性。主要结果是大语言模型相对于正确答案的准确性。次要结果包括在鉴别诊断、辅助检查和治疗子测试中的比较表现；以及回复的可读性。

结果

分数呈正态分布，范围为0 - 35（满分60分），平均 ± 标准差为19 ± 9。三个大语言模型（ChatGPT 3.5、Claude Pro和Copilot Pro）的分数在统计学上显著高于平均分。两个表现出色的大语言模型是付费订阅版（Claude Pro和Copilot Pro），一个是免费版（ChatGPT �.5）。虽然ChatGPT 3.5和Claude Pro之间在临床或统计学上没有差异，但Copilot Pro与其他排名靠前的大语言模型之间存在5分或0.56标准差的差距。所有测试程序的可读性均高于美国医学协会（AMA）向公众消费者推荐的八年级阅读水平。

结论

订阅版大语言模型在排名靠前的大语言模型中更为普遍，这表明它们作为眼科助手的表现更好。虽然对于普通人来说可读性较差，但内容能被获得委员会认证的眼科医生理解。大语言模型的准确性还不够高，不足以推荐其独立用于患者护理，但在协助临床医生进行患者护理和防止疏忽方面很有前景。

相似文献

Comparison of Large Language Models in Diagnosis and Management of Challenging Clinical Cases.大语言模型在疑难临床病例诊断与管理中的比较

Clin Ophthalmol. 2024 Nov 12;18:3239-3247. doi: 10.2147/OPTH.S488232. eCollection 2024.

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.眼科医生与大型语言模型聊天机器人对在线患者眼部护理问题的回复比较。

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

Evaluating the reliability of the responses of large language models to keratoconus-related questions.评估大语言模型对圆锥角膜相关问题回答的可靠性。

Clin Exp Optom. 2024 Oct 24:1-8. doi: 10.1080/08164622.2024.2419524.

Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy.评估大语言模型（ChatGPT-4、Gemini和Microsoft Copilot）对乳腺成像常见问题的回答：可读性和准确性研究

Cureus. 2024 May 9;16(5):e59960. doi: 10.7759/cureus.59960. eCollection 2024 May.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能：比较混合方法研究。

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.利用人工智能在减重手术中的应用：ChatGPT-4、Bing 和 Bard 在生成临床医生水平的减重手术建议方面的比较分析。

Surg Obes Relat Dis. 2024 Jul;20(7):603-608. doi: 10.1016/j.soard.2024.03.011. Epub 2024 Mar 24.

Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients.人工智能能回答我的问题吗？腹部整形手术患者围手术期评估中人工智能的应用。

Aesthetic Plast Surg. 2024 Nov;48(22):4712-4724. doi: 10.1007/s00266-024-04157-0. Epub 2024 Jun 19.

Dr. Google vs. Dr. ChatGPT: Exploring the Use of Artificial Intelligence in Ophthalmology by Comparing the Accuracy, Safety, and Readability of Responses to Frequently Asked Patient Questions Regarding Cataracts and Cataract Surgery.谷歌医生与ChatGPT医生：通过比较关于白内障及白内障手术的常见患者问题的回答的准确性、安全性和可读性，探索人工智能在眼科领域的应用。

Semin Ophthalmol. 2024 Aug;39(6):472-479. doi: 10.1080/08820538.2024.2326058. Epub 2024 Mar 22.

Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis.用于黑色素瘤诊断的皮肤镜图像分析中Claude 3 Opus和配备GPT-4的ChatGPT：比较性能分析

JMIR Med Inform. 2024 Aug 6;12:e59273. doi: 10.2196/59273.

引用本文的文献

Large language models' capabilities in responding to tuberculosis medical questions: testing ChatGPT, Gemini, and Copilot.大型语言模型在回答结核病医学问题方面的能力：对ChatGPT、Gemini和Copilot进行测试

Sci Rep. 2025 May 23;15(1):18004. doi: 10.1038/s41598-025-03074-9.

本文引用的文献

Analysis of ChatGPT Responses to Ophthalmic Cases: Can ChatGPT Think like an Ophthalmologist?ChatGPT对眼科病例的回答分析：ChatGPT能像眼科医生一样思考吗？

Ophthalmol Sci. 2024 Aug 23;5(1):100600. doi: 10.1016/j.xops.2024.100600. eCollection 2025 Jan-Feb.

AI-Powered Clinical Documentation and Clinicians' Electronic Health Record Experience: A Nonrandomized Clinical Trial.人工智能驱动的临床文档记录与临床医生的电子健康记录体验：一项非随机临床试验。

JAMA Netw Open. 2024 Sep 3;7(9):e2432460. doi: 10.1001/jamanetworkopen.2024.32460.

Chatbot and Academy Preferred Practice Pattern Guidelines on Retinal Diseases.视网膜疾病的聊天机器人与学会首选实践模式指南

Ophthalmol Retina. 2024 Jul;8(7):723-725. doi: 10.1016/j.oret.2024.03.013. Epub 2024 Mar 17.

Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions.多模态大语言模型在临床病例问题上的性能比较分析

JAMA. 2024 Apr 16;331(15):1320-1321. doi: 10.1001/jama.2023.27861.

Utility of artificial intelligence-based large language models in ophthalmic care.人工智能大型语言模型在眼科护理中的应用。

Ophthalmic Physiol Opt. 2024 May;44(3):641-671. doi: 10.1111/opo.13284. Epub 2024 Feb 25.

Assessment of a Large Language Model's Responses to Questions and Cases About Glaucoma and Retina Management.评估大型语言模型对青光眼和视网膜管理相关问题和病例的回答。

JAMA Ophthalmol. 2024 Apr 1;142(4):371-375. doi: 10.1001/jamaophthalmol.2023.6917.

PERFORMANCE ASSESSMENT OF AN ARTIFICIAL INTELLIGENCE CHATBOT IN CLINICAL VITREORETINAL SCENARIOS.人工智能在临床玻璃体视网膜场景中的表现评估。

Retina. 2024 Jun 1;44(6):954-964. doi: 10.1097/IAE.0000000000004053.

THE ABILITY OF ARTIFICIAL INTELLIGENCE CHATBOTS ChatGPT AND GOOGLE BARD TO ACCURATELY CONVEY PREOPERATIVE INFORMATION FOR PATIENTS UNDERGOING OPHTHALMIC SURGERIES.人工智能聊天机器人 ChatGPT 和谷歌巴德准确传达接受眼科手术患者术前信息的能力。

Retina. 2024 Jun 1;44(6):950-953. doi: 10.1097/IAE.0000000000004044.

Advances in Artificial Intelligence Chatbot Technology in Ophthalmology.

JAMA Ophthalmol. 2023 Nov 1;141(11):1088. doi: 10.1001/jamaophthalmol.2023.4619.

ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination.ChatGPT-4：美国医师执照考试中人工智能聊天机器人的升级评估。

Med Teach. 2024 Mar;46(3):366-372. doi: 10.1080/0142159X.2023.2249588. Epub 2023 Oct 15.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验