Antaki Fares, Touma Samir, Milad Daniel, El-Khoury Jonathan, Duval Renaud
Department of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada.
Centre Universitaire d'Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l'Est-de-l'Île-de-Montréal, Montréal, Quebec, Canada.
Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.
Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space.
Evaluation of diagnostic test or technology.
ChatGPT is a publicly available LLM.
We tested 2 versions of ChatGPT (January 9 "legacy" and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey's test to decide if there were meaningful differences between the tested subspecialties.
We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT's outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a value of < 0.05.
The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; = 0.006) followed by question difficulty (LR, 24.05; < 0.001) were most predictive of ChatGPT's answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology ( < 0.001) and ocular pathology ( = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections.
ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties.
Proprietary or commercial disclosure may be found after the references.
基础模型是一种新型人工智能算法,其中模型在无标注数据上进行大规模预训练,并针对诸如生成文本等众多下游任务进行微调。本研究评估了大型语言模型(LLM)ChatGPT在眼科问答领域的准确性。
诊断测试或技术的评估。
ChatGPT是一个可公开获取的LLM。
我们在两个常用于准备高风险眼科知识评估计划(OKAP)考试的流行多项选择题库上测试了ChatGPT的两个版本(1月9日的“旧版”和ChatGPT Plus)。我们从基础与临床科学课程(BCSC)自我评估计划和OphthoQuestions在线题库中生成了两场各有260道题的模拟考试。我们进行逻辑回归以确定考试部分、认知水平和难度指数对答案准确性的影响。我们还使用Tukey检验进行事后分析,以确定测试的亚专业之间是否存在有意义的差异。
通过将ChatGPT的输出与题库提供的答案键进行比较,我们报告了ChatGPT在每个考试部分的正确百分比准确率。我们用似然比(LR)卡方展示逻辑回归结果。我们认为当p值<0.05时,考试部分之间的差异具有统计学意义。
旧版模型在BCSC题库上的准确率为55.8%,在OphthoQuestions题库上为42.7%。使用ChatGPT Plus时,准确率分别提高到59.4%±0.6%和49.2%±1.0%。在控制考试部分和认知水平时,较简单的问题准确率更高。对旧版模型的逻辑回归分析表明,考试部分(LR,27.57;p = 0.006)其次是问题难度(LR,24.05;p < 0.001)对ChatGPT的答案准确性预测性最强。尽管旧版模型在普通医学方面表现最佳,在神经眼科(p < 0.001)和眼部病理学(p = 0.029)方面表现最差,但ChatGPT Plus未出现类似的事后分析结果,表明各考试部分的结果更一致。
ChatGPT在模拟OKAP考试中表现出令人鼓舞的成绩。通过特定领域的预训练对LLMs进行专业化处理可能是提高其在眼科亚专业表现的必要条件。
专有或商业披露信息可在参考文献之后找到。