葡萄膜炎中大型语言模型性能的基准测试：ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.

作者信息

Zhao Fang-Fang, He Han-Jie, Liang Jia-Jian, Cen Jingyun, Wang Yun, Lin Hongjie, Chen Feifei, Li Tai-Ping, Yang Jian-Feng, Chen Lan, Cen Ling-Ping

机构信息

Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China.

Shantou University Medical College, Shantou, Guangdong, China.

出版信息

Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

DOI:10.1038/s41433-024-03545-9

PMID:39690303

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11978972/

Abstract

BACKGROUND/OBJECTIVE: This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology.

METHODS

Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05.

RESULTS

Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3.

CONCLUSIONS

Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.

摘要

背景/目的：本研究旨在运用细致的分级方法，评估各种大语言模型（LLMs）（ChatGPT - 3.5、Gemini、Claude 3和GPT - 4.0）在葡萄膜炎临床背景下生成回答的准确性、全面性和可读性。

方法

向四个大语言模型（LLMs）：ChatGPT（GPT - 3.5和GPT - 4.0版本）、谷歌Gemini和Claude分别单独提出27个葡萄膜炎临床问题。三位经验丰富的葡萄膜炎专家在三轮评估中，使用三点量表独立评估回答的准确性，每次评估间隔48小时。每个大语言模型回答的最终准确性评级（“优秀”、“边缘”或“不足”）通过多数共识法确定。对于在最终准确性评估中评为“优秀”的回答，使用三点量表评估全面性。使用弗莱什 - 金凯德年级水平公式确定可读性。进行统计分析以辨别大语言模型之间的显著差异，显著性阈值设定为p < 0.05。

结果

与Gemini相比，Claude 3和ChatGPT 4的准确性显著更高（p < 0.001）。Claude 3的“优秀”评级比例也最高（96.3%），其次是ChatGPT 4（88.9%）。与Gemini不同（14.8%）（p = 0.014），ChatGPT 3.5、Claude 3和ChatGPT 4没有回答被评为“不足”。与Gemini相比，ChatGPT 4表现出更高的全面性（p = 0.008），与Gemini相比，Claude 3表现出更高的全面性（p = 0.042）。与ChatGPT 3.5、Claude 3和ChatGPT 4相比，Gemini的可读性显著更好（p < 0.001）。与ChatGPT 3.5和Claude 3相比，Gemini的单词、字母字符和句子数量也更少。

结论

我们的研究突出了Claude 3和ChatGPT 4在提供关于葡萄膜炎的精确和全面信息方面的出色表现，超过了Gemini。ChatGPT 4和Claude 3成为改善患者对葡萄膜炎医疗过程的理解和参与度的关键工具。

相似文献

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试：ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析

Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study.使用大语言模型提高在线患者教育材料的可读性：横断面研究。

J Med Internet Res. 2025 Jun 4;27:e69955. doi: 10.2196/69955.

Evaluating Large Language Models for Preoperative Patient Education in Superior Capsular Reconstruction: Comparative Study of Claude, GPT, and Gemini.评估大语言模型在肩胛下肌上囊重建术前患者教育中的应用：Claude、GPT和Gemini的比较研究

JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.

Artificial Intelligence in Peripheral Artery Disease Education: A Battle Between ChatGPT and Google Gemini.外周动脉疾病教育中的人工智能：ChatGPT与谷歌Gemini的较量

Cureus. 2025 Jun 1;17(6):e85174. doi: 10.7759/cureus.85174. eCollection 2025 Jun.

Comparative Performance of the Leading Large Language Models in Answering Complex Rhinoplasty Consultation Questions.领先的大语言模型在回答复杂鼻整形咨询问题方面的比较性能。

Facial Plast Surg Aesthet Med. 2025 Jan 15. doi: 10.1089/fpsam.2024.0206.

Parental education in pediatric dysphagia: A comparative analysis of three large language models.儿科吞咽困难中的家长教育：三种大型语言模型的比较分析

J Pediatr Gastroenterol Nutr. 2025 Jul;81(1):18-26. doi: 10.1002/jpn3.70069. Epub 2025 May 8.

Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.大语言模型在医学胚胎学中的性能比较分析：ChatGPT、Claude、Gemini和Copilot的跨平台研究

Anat Sci Educ. 2025 May 11. doi: 10.1002/ase.70044.

Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.甲状腺眼病与人工智能：ChatGPT-3.5、ChatGPT-4o和Gemini在患者信息传递方面的比较研究

Ophthalmic Plast Reconstr Surg. 2024 Dec 24. doi: 10.1097/IOP.0000000000002882.

Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.使用大语言模型对黄蜂蜇伤进行临床管理：横断面评估研究

J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.

A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection.对大语言模型生成的尸体臂丛神经解剖分步指导的结构化评估。

BMC Med Educ. 2025 Jul 1;25(1):903. doi: 10.1186/s12909-025-07493-0.

引用本文的文献

Comparison of the readability of ChatGPT and Bard in medical communication: a meta-analysis.ChatGPT与Bard在医学交流中的可读性比较：一项荟萃分析。

BMC Med Inform Decis Mak. 2025 Sep 1;25(1):325. doi: 10.1186/s12911-025-03035-2.

Large language models in the management of chronic ocular diseases: a scoping review.大语言模型在慢性眼病管理中的应用：一项范围综述

Front Cell Dev Biol. 2025 Jun 18;13:1608988. doi: 10.3389/fcell.2025.1608988. eCollection 2025.

Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages.使用不同的提示策略和语言评估大语言模型在房颤管理方面的性能。

Sci Rep. 2025 May 30;15(1):19028. doi: 10.1038/s41598-025-04309-5.

Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.大语言模型对脊髓损伤的反应：性能比较研究

J Med Syst. 2025 Mar 25;49(1):39. doi: 10.1007/s10916-025-02170-7.

Reply to 'Comment on: Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3'.对“关于：葡萄膜炎中大型语言模型性能的基准测试：ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析”的评论的回复

Eye (Lond). 2025 May;39(7):1433. doi: 10.1038/s41433-025-03737-x. Epub 2025 Mar 5.

Comment on: "Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3".关于《葡萄膜炎中大型语言模型性能的基准测试：ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析》的评论

Eye (Lond). 2025 May;39(7):1432. doi: 10.1038/s41433-025-03736-y. Epub 2025 Feb 26.

本文引用的文献

Utility of artificial intelligence-based large language models in ophthalmic care.人工智能大型语言模型在眼科护理中的应用。

Ophthalmic Physiol Opt. 2024 May;44(3):641-671. doi: 10.1111/opo.13284. Epub 2024 Feb 25.

Investigating the Accuracy and Completeness of an Artificial Intelligence Large Language Model About Uveitis: An Evaluation of ChatGPT.探讨一款关于葡萄膜炎的人工智能大语言模型的准确性和完整性：ChatGPT 的评估。

Ocul Immunol Inflamm. 2024 Nov;32(9):2052-2055. doi: 10.1080/09273948.2024.2317417. Epub 2024 Feb 23.

Patient-Directed Vasectomy Information: How Readable Is It?患者导向的输精管切除术信息：其可读性如何？

World J Mens Health. 2024 Apr;42(2):408-414. doi: 10.5534/wjmh.230033. Epub 2023 Sep 1.

The Use of Large Language Models to Generate Education Materials about Uveitis.使用大型语言模型生成有关葡萄膜炎的教育材料。

Ophthalmol Retina. 2024 Feb;8(2):195-201. doi: 10.1016/j.oret.2023.09.008. Epub 2023 Sep 15.

ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health.ChatGPT 和大型语言模型的兴起：公共卫生领域新的 AI 驱动的信息疫情威胁。

Front Public Health. 2023 Apr 25;11:1166120. doi: 10.3389/fpubh.2023.1166120. eCollection 2023.

Readability and Suitability of Online Uveitis Patient Education Materials.在线葡萄膜炎患者教育材料的可读性和适宜性。

Ocul Immunol Inflamm. 2024 Sep;32(7):1175-1179. doi: 10.1080/09273948.2023.2203759. Epub 2023 May 5.

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。

JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.GPT-4作为医学人工智能聊天机器人的益处、局限性和风险

N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.

Effectiveness of health education in the self-care and adherence of patients with heart failure: a meta-analysis.健康教育对心力衰竭患者自我护理和依从性的有效性：一项荟萃分析。

Rev Lat Am Enfermagem. 2021 Jul 19;29:e3389. doi: 10.1590/1518.8345.4281.3389. eCollection 2021.

Patient-Centered Education in Wound Management: Improving Outcomes and Adherence.患者为中心的伤口管理教育：改善结局和依从性。

Adv Skin Wound Care. 2021 Aug 1;34(8):403-410. doi: 10.1097/01.ASW.0000753256.29578.6c.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验