评估在眼表疾病中使用大语言模型的可能性。

Assessing the possibility of using large language models in ocular surface diseases.

作者信息

Ling Qian, Xu Zi-Song, Zeng Yan-Mei, Hong Qi, Qian Xian-Zhe, Hu Jin-Yu, Pei Chong-Gang, Wei Hong, Zou Jie, Chen Cheng, Wang Xiao-Yu, Chen Xu, Wu Zhen-Kai, Shao Yi

机构信息

Department of Ophthalmology, the First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China.

Ophthalmology Centre of Maastricht University, Maastricht 6200MS, Limburg, Netherlands.

出版信息

Int J Ophthalmol. 2025 Jan 18;18(1):1-8. doi: 10.18240/ijo.2025.01.01. eCollection 2025.

DOI:10.18240/ijo.2025.01.01

PMID:39829624

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11672086/

Abstract

AIM

To assess the possibility of using different large language models (LLMs) in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases: ChatGPT-4, ChatGPT-3.5, Claude 2, PaLM2, and SenseNova.

METHODS

A group of experienced ophthalmology professors were asked to develop a 100-question single-choice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions. The exam includes questions on the following topics: keratitis disease (20 questions), keratoconus, keratomalaciac, corneal dystrophy, corneal degeneration, erosive corneal ulcers, and corneal lesions associated with systemic diseases (20 questions), conjunctivitis disease (20 questions), trachoma, pterygoid and conjunctival tumor diseases (20 questions), and dry eye disease (20 questions). Then the total score of each LLMs and compared their mean score, mean correlation, variance, and confidence were calculated.

RESULTS

GPT-4 exhibited the highest performance in terms of LLMs. Comparing the average scores of the LLMs group with the four human groups, chief physician, attending physician, regular trainee, and graduate student, it was found that except for ChatGPT-4, the total score of the rest of the LLMs is lower than that of the graduate student group, which had the lowest score in the human group. Both ChatGPT-4 and PaLM2 were more likely to give exact and correct answers, giving very little chance of an incorrect answer. ChatGPT-4 showed higher credibility when answering questions, with a success rate of 59%, but gave the wrong answer to the question 28% of the time.

CONCLUSION

GPT-4 model exhibits excellent performance in both answer relevance and confidence. PaLM2 shows a positive correlation (up to 0.8) in terms of answer accuracy during the exam. In terms of answer confidence, PaLM2 is second only to GPT4 and surpasses Claude 2, SenseNova, and GPT-3.5. Despite the fact that ocular surface disease is a highly specialized discipline, GPT-4 still exhibits superior performance, suggesting that its potential and ability to be applied in this field is enormous, perhaps with the potential to be a valuable resource for medical students and clinicians in the future.

摘要

目的

通过选择五种不同的大语言模型来测试它们在回答与眼表疾病相关的专业问题时的准确性，以评估在眼表疾病中使用不同大语言模型（LLMs）的可能性：ChatGPT-4、ChatGPT-3.5、Claude 2、PaLM2和文生·万象。

方法

一组经验丰富的眼科教授被要求编制一份关于眼表疾病的100道单项选择题，旨在评估大语言模型和人类参与者在回答眼科专业考试问题方面的表现。该考试包括以下主题的问题：角膜炎疾病（20道题）、圆锥角膜、角膜软化症、角膜营养不良、角膜变性、糜烂性角膜溃疡以及与全身性疾病相关的角膜病变（20道题）、结膜炎疾病（20道题）、沙眼、翼状胬肉和结膜肿瘤疾病（20道题）以及干眼疾病（20道题）。然后计算每个大语言模型的总分，并比较它们的平均得分、平均相关性、方差和置信度。

结果

在大语言模型中，GPT-4表现出最高的性能。将大语言模型组的平均得分与四个人类组（主任医师、主治医师、普通实习生和研究生）进行比较，发现除了ChatGPT-4之外，其余大语言模型的总分均低于人类组中得分最低的研究生组。ChatGPT-4和PaLM2都更有可能给出准确正确的答案，给出错误答案的可能性很小。ChatGPT-4在回答问题时显示出更高的可信度，成功率为59%，但有28%的时间给出错误答案。

结论

GPT-4模型在答案相关性和置信度方面均表现出色。PaLM2在考试期间的答案准确性方面呈现正相关（高达0.8）。在答案置信度方面，PaLM2仅次于GPT4，超过Claude 2、文生·万象和GPT-3.5。尽管眼表疾病是一个高度专业化的学科，但GPT-4仍表现出卓越的性能，这表明其在该领域的应用潜力和能力巨大，未来可能成为医学生和临床医生的宝贵资源。

相似文献

Assessing the possibility of using large language models in ocular surface diseases.评估在眼表疾病中使用大语言模型的可能性。

Int J Ophthalmol. 2025 Jan 18;18(1):1-8. doi: 10.18240/ijo.2025.01.01. eCollection 2025.

Evaluating large language models on a highly-specialized topic, radiation oncology physics.在高度专业化的主题——放射肿瘤物理学上评估大语言模型。

Front Oncol. 2023 Jul 17;13:1219326. doi: 10.3389/fonc.2023.1219326. eCollection 2023.

Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.在大体解剖学课程中使用大语言模型（ChatGPT、Copilot、PaLM、Bard和Gemini）：比较分析

Clin Anat. 2025 Mar;38(2):200-210. doi: 10.1002/ca.24244. Epub 2024 Nov 21.

A comparative analysis of GPT-3.5 and GPT-4.0 on a multiple-choice ophthalmology question bank: A study on artificial intelligence developments.基于多项选择题眼科题库对GPT-3.5和GPT-4.0的比较分析：一项关于人工智能发展的研究。

Rom J Ophthalmol. 2024 Oct-Dec;68(4):367-371. doi: 10.22336/rjo.2024.67.

A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology.大语言模型与人类受试者在皮肤病学方面表现的比较分析

Indian Dermatol Online J. 2025 Feb 27;16(2):241-247. doi: 10.4103/idoj.idoj_221_24. eCollection 2025 Mar-Apr.

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

Evaluating accuracy and reproducibility of large language model performance on critical care assessments in pharmacy education.评估大语言模型在药学教育中的重症护理评估方面的性能准确性和可重复性。

Front Artif Intell. 2025 Jan 9;7:1514896. doi: 10.3389/frai.2024.1514896. eCollection 2024.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。

Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.大语言模型回答眼科考试式问题的准确性：一项荟萃分析。

Asia Pac J Ophthalmol (Phila). 2024 Sep-Oct;13(5):100106. doi: 10.1016/j.apjo.2024.100106. Epub 2024 Oct 5.

Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context.评估大语言模型（LLMs）在中国背景下回答有关乳腺癌医学问题的表现。

Digit Health. 2024 Oct 7;10:20552076241284771. doi: 10.1177/20552076241284771. eCollection 2024 Jan-Dec.

引用本文的文献

AI in conjunctivitis research: assessing ChatGPT and DeepSeek for etiology, intervention, and citation integrity via hallucination rate analysis.人工智能在结膜炎研究中的应用：通过幻觉率分析评估ChatGPT和百川智能在病因、干预措施及引用完整性方面的表现

Front Artif Intell. 2025 Aug 20;8:1579375. doi: 10.3389/frai.2025.1579375. eCollection 2025.

Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis.评估ChatGPT在眼科板型考试问题上的表现：一项荟萃分析。

J Med Syst. 2025 Jul 5;49(1):94. doi: 10.1007/s10916-025-02227-7.

Large language models in the management of chronic ocular diseases: a scoping review.大语言模型在慢性眼病管理中的应用：一项范围综述

Front Cell Dev Biol. 2025 Jun 18;13:1608988. doi: 10.3389/fcell.2025.1608988. eCollection 2025.

本文引用的文献

ChatGPT in healthcare: A taxonomy and systematic review.ChatGPT 在医疗保健中的应用：分类法与系统综述。

Comput Methods Programs Biomed. 2024 Mar;245:108013. doi: 10.1016/j.cmpb.2024.108013. Epub 2024 Jan 15.

The future landscape of large language models in medicine.医学领域大语言模型的未来前景。

Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.

The Use of Large Language Models to Generate Education Materials about Uveitis.使用大型语言模型生成有关葡萄膜炎的教育材料。

Ophthalmol Retina. 2024 Feb;8(2):195-201. doi: 10.1016/j.oret.2023.09.008. Epub 2023 Sep 15.

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.眼科医生与大型语言模型聊天机器人对在线患者眼部护理问题的回复比较。

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

The Potential Role of Large Language Models in Uveitis Care: Perspectives After ChatGPT and Bard Launch.大语言模型在葡萄膜炎护理中的潜在作用：ChatGPT和Bard发布后的观点

Ocul Immunol Inflamm. 2024 Sep;32(7):1435-1439. doi: 10.1080/09273948.2023.2242462. Epub 2023 Aug 10.

Large language models in medicine.医学中的大型语言模型。

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

Large language models encode clinical knowledge.大语言模型编码临床知识。

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

The imperative for regulatory oversight of large language models (or generative AI) in healthcare.对医疗保健领域的大语言模型（或生成式人工智能）进行监管监督的必要性。

NPJ Digit Med. 2023 Jul 6;6(1):120. doi: 10.1038/s41746-023-00873-0.

Artificial intelligence in ophthalmology: The path to the real-world clinic.人工智能在眼科学中的应用：通往现实临床的道路。

Cell Rep Med. 2023 Jul 18;4(7):101095. doi: 10.1016/j.xcrm.2023.101095. Epub 2023 Jun 28.

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.评估ChatGPT在眼科领域的表现：对其优缺点的分析。

Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验