评估多种大语言模型在眼眶疾病方面的表现。

Evaluating multiple large language models on orbital diseases.

作者信息

Yang Qi-Chen, Zeng Yan-Mei, Wei Hong, Chen Cheng, Ling Qian, Wang Xiao-Yu, Chen Xu, Shao Yi

机构信息

Shanghai General Hospital, National Clinical Research Center for Eye Diseases, Shanghai Key Clinical Specialty, Shanghai Key Laboratory of Ocular Fundus Diseases, Shanghai Engineering Center for Visual Science and Photomedicine, Shanghai Engineering Center for Precise Diagnosis and Treatment of Eye Diseases, National Clinical Key Specialty Construction Project, Eye & ENT Hospital of Fudan University, Shanghai, China.

Department of Ophthalmology, The West China Hospital of Sichuan University, Chengdu, Sichuan, China.

出版信息

Front Cell Dev Biol. 2025 Jul 7;13:1574378. doi: 10.3389/fcell.2025.1574378. eCollection 2025.

Abstract

The avoidance of mistakes by humans is achieved through continuous learning, error correction, and experience accumulation. This process is known to be both time-consuming and laborious, often involving numerous detours. In order to assist humans in their learning endeavors, ChatGPT (Generative Pre-trained Transformer) has been developed as a collection of large language models (LLMs) capable of generating responses that resemble human-like answers to a wide range of problems. In this study, we sought to assess the potential of LLMs as assistants in addressing queries related to orbital diseases. To accomplish this, we gathered a dataset consisting of 100 orbital questions, along with their corresponding answers, sourced from examinations administered to ophthalmologist residents and medical students. Five language models (LLMs) were utilized for testing and comparison purposes, namely, GPT-4, GPT-3.5, PaLM2, Claude 2, and SenseNova. Subsequently, the LLM exhibiting the most exemplary performance was selected for comparison against ophthalmologists and medical students. Notably, GPT-4 and PaLM2 demonstrated a superior average correlation when compared to the other LLMs. Furthermore, GPT-4 exhibited a broader spectrum of accurate responses and attained the highest average score among all the LLMs. Additionally, GPT-4 demonstrated the highest level of confidence during the test. The performance of GPT-4 surpassed that of medical students, albeit falling short of that exhibited by ophthalmologists. In contrast, the findings of the study indicate that GPT-4 exhibited superior performance within the orbital domain of ophthalmology. Given further refinement through training, LLMs possess considerable potential to be utilized as comprehensive instruments alongside medical students and ophthalmologists.

摘要

人类通过持续学习、纠错和经验积累来避免犯错。这个过程既耗时又费力,常常需要走许多弯路。为了帮助人类学习,ChatGPT(生成式预训练变换器)作为一系列大语言模型(LLMs)被开发出来,这些模型能够生成类似于人类对各种问题的回答。在本研究中,我们试图评估大语言模型作为解决眼眶疾病相关问题助手的潜力。为实现这一目标,我们收集了一个数据集,其中包含100个眼眶问题及其相应答案,这些问题和答案来自对眼科住院医师和医学生进行的考试。为了测试和比较,我们使用了五个大语言模型,即GPT-4、GPT-3.5、PaLM2、Claude 2和文生大模型。随后,选择表现最出色的大语言模型与眼科医生和医学生进行比较。值得注意的是,与其他大语言模型相比,GPT-4和PaLM2表现出更高的平均相关性。此外,GPT-4展示了更广泛的准确回答范围,并且在所有大语言模型中获得了最高平均分。此外,GPT-4在测试过程中表现出最高的置信度。GPT-4的表现超过了医学生,尽管不及眼科医生。相比之下,研究结果表明GPT-4在眼科眼眶领域表现出卓越的性能。通过进一步训练优化,大语言模型有很大潜力作为医学生和眼科医生的综合工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f289/12277337/6ee73490058c/fcell-13-1574378-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索