Suppr超能文献

评估在眼表疾病中使用大语言模型的可能性。

Assessing the possibility of using large language models in ocular surface diseases.

作者信息

Ling Qian, Xu Zi-Song, Zeng Yan-Mei, Hong Qi, Qian Xian-Zhe, Hu Jin-Yu, Pei Chong-Gang, Wei Hong, Zou Jie, Chen Cheng, Wang Xiao-Yu, Chen Xu, Wu Zhen-Kai, Shao Yi

机构信息

Department of Ophthalmology, the First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China.

Ophthalmology Centre of Maastricht University, Maastricht 6200MS, Limburg, Netherlands.

出版信息

Int J Ophthalmol. 2025 Jan 18;18(1):1-8. doi: 10.18240/ijo.2025.01.01. eCollection 2025.

Abstract

AIM

To assess the possibility of using different large language models (LLMs) in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases: ChatGPT-4, ChatGPT-3.5, Claude 2, PaLM2, and SenseNova.

METHODS

A group of experienced ophthalmology professors were asked to develop a 100-question single-choice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions. The exam includes questions on the following topics: keratitis disease (20 questions), keratoconus, keratomalaciac, corneal dystrophy, corneal degeneration, erosive corneal ulcers, and corneal lesions associated with systemic diseases (20 questions), conjunctivitis disease (20 questions), trachoma, pterygoid and conjunctival tumor diseases (20 questions), and dry eye disease (20 questions). Then the total score of each LLMs and compared their mean score, mean correlation, variance, and confidence were calculated.

RESULTS

GPT-4 exhibited the highest performance in terms of LLMs. Comparing the average scores of the LLMs group with the four human groups, chief physician, attending physician, regular trainee, and graduate student, it was found that except for ChatGPT-4, the total score of the rest of the LLMs is lower than that of the graduate student group, which had the lowest score in the human group. Both ChatGPT-4 and PaLM2 were more likely to give exact and correct answers, giving very little chance of an incorrect answer. ChatGPT-4 showed higher credibility when answering questions, with a success rate of 59%, but gave the wrong answer to the question 28% of the time.

CONCLUSION

GPT-4 model exhibits excellent performance in both answer relevance and confidence. PaLM2 shows a positive correlation (up to 0.8) in terms of answer accuracy during the exam. In terms of answer confidence, PaLM2 is second only to GPT4 and surpasses Claude 2, SenseNova, and GPT-3.5. Despite the fact that ocular surface disease is a highly specialized discipline, GPT-4 still exhibits superior performance, suggesting that its potential and ability to be applied in this field is enormous, perhaps with the potential to be a valuable resource for medical students and clinicians in the future.

摘要

目的

通过选择五种不同的大语言模型来测试它们在回答与眼表疾病相关的专业问题时的准确性,以评估在眼表疾病中使用不同大语言模型(LLMs)的可能性:ChatGPT-4、ChatGPT-3.5、Claude 2、PaLM2和文生·万象。

方法

一组经验丰富的眼科教授被要求编制一份关于眼表疾病的100道单项选择题,旨在评估大语言模型和人类参与者在回答眼科专业考试问题方面的表现。该考试包括以下主题的问题:角膜炎疾病(20道题)、圆锥角膜、角膜软化症、角膜营养不良、角膜变性、糜烂性角膜溃疡以及与全身性疾病相关的角膜病变(20道题)、结膜炎疾病(20道题)、沙眼、翼状胬肉和结膜肿瘤疾病(20道题)以及干眼疾病(20道题)。然后计算每个大语言模型的总分,并比较它们的平均得分、平均相关性、方差和置信度。

结果

在大语言模型中,GPT-4表现出最高的性能。将大语言模型组的平均得分与四个人类组(主任医师、主治医师、普通实习生和研究生)进行比较,发现除了ChatGPT-4之外,其余大语言模型的总分均低于人类组中得分最低的研究生组。ChatGPT-4和PaLM2都更有可能给出准确正确的答案,给出错误答案的可能性很小。ChatGPT-4在回答问题时显示出更高的可信度,成功率为59%,但有28%的时间给出错误答案。

结论

GPT-4模型在答案相关性和置信度方面均表现出色。PaLM2在考试期间的答案准确性方面呈现正相关(高达0.8)。在答案置信度方面,PaLM2仅次于GPT4,超过Claude 2、文生·万象和GPT-3.5。尽管眼表疾病是一个高度专业化的学科,但GPT-4仍表现出卓越的性能,这表明其在该领域的应用潜力和能力巨大,未来可能成为医学生和临床医生的宝贵资源。

相似文献

本文引用的文献

1
ChatGPT in healthcare: A taxonomy and systematic review.ChatGPT 在医疗保健中的应用:分类法与系统综述。
Comput Methods Programs Biomed. 2024 Mar;245:108013. doi: 10.1016/j.cmpb.2024.108013. Epub 2024 Jan 15.
2
The future landscape of large language models in medicine.医学领域大语言模型的未来前景。
Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.
6
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
7
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验