Thirunavukarasu Arun James, Mahmood Shathar, Malem Andrew, Foster William Paul, Sanghera Rohan, Hassan Refaat, Zhou Sean, Wong Shiao Wei, Wong Yee Ling, Chong Yu Jeat, Shakeel Abdullah, Chang Yin-Hsi, Tan Benjamin Kye Jyn, Jain Nikhil, Tan Ting Fang, Rauz Saaeha, Ting Daniel Shu Wei, Ting Darren Shu Jeng
University of Cambridge School of Clinical Medicine, Cambridge, United Kingdom.
Oxford University Clinical Academic Graduate School, University of Oxford, Oxford, United Kingdom.
PLOS Digit Health. 2024 Apr 17;3(4):e0000341. doi: 10.1371/journal.pdig.0000341. eCollection 2024 Apr.
Large language models (LLMs) underlie remarkable recent advanced in natural language processing, and they are beginning to be applied in clinical contexts. We aimed to evaluate the clinical potential of state-of-the-art LLMs in ophthalmology using a more robust benchmark than raw examination scores. We trialled GPT-3.5 and GPT-4 on 347 ophthalmology questions before GPT-3.5, GPT-4, PaLM 2, LLaMA, expert ophthalmologists, and doctors in training were trialled on a mock examination of 87 questions. Performance was analysed with respect to question subject and type (first order recall and higher order reasoning). Masked ophthalmologists graded the accuracy, relevance, and overall preference of GPT-3.5 and GPT-4 responses to the same questions. The performance of GPT-4 (69%) was superior to GPT-3.5 (48%), LLaMA (32%), and PaLM 2 (56%). GPT-4 compared favourably with expert ophthalmologists (median 76%, range 64-90%), ophthalmology trainees (median 59%, range 57-63%), and unspecialised junior doctors (median 43%, range 41-44%). Low agreement between LLMs and doctors reflected idiosyncratic differences in knowledge and reasoning with overall consistency across subjects and types (p>0.05). All ophthalmologists preferred GPT-4 responses over GPT-3.5 and rated the accuracy and relevance of GPT-4 as higher (p<0.05). LLMs are approaching expert-level knowledge and reasoning skills in ophthalmology. In view of the comparable or superior performance to trainee-grade ophthalmologists and unspecialised junior doctors, state-of-the-art LLMs such as GPT-4 may provide useful medical advice and assistance where access to expert ophthalmologists is limited. Clinical benchmarks provide useful assays of LLM capabilities in healthcare before clinical trials can be designed and conducted.
大语言模型(LLMs)是近期自然语言处理领域取得显著进展的基础,并且它们开始被应用于临床环境。我们旨在使用比原始考试分数更可靠的基准来评估最先进的大语言模型在眼科领域的临床潜力。我们让GPT-3.5和GPT-4回答了347个眼科问题,然后让GPT-3.5、GPT-4、PaLM 2、LLaMA、眼科专家和实习医生参加了一场包含87个问题的模拟考试。根据问题主题和类型(一阶回忆和高阶推理)对表现进行了分析。蒙面眼科医生对GPT-3.5和GPT-4对相同问题的回答的准确性、相关性和总体偏好进行了评分。GPT-4的表现(69%)优于GPT-3.5(48%)、LLaMA(32%)和PaLM 2(56%)。GPT-4与眼科专家(中位数76%,范围64 - 90%)、眼科实习生(中位数59%,范围57 - 63%)和非专科初级医生(中位数43%,范围41 - 44%)相比表现良好。大语言模型和医生之间的一致性较低,反映出知识和推理方面的特质差异,而在不同主题和类型之间总体具有一致性(p>0.05)。所有眼科医生都更喜欢GPT-4的回答而不是GPT-3.5的回答,并认为GPT-4的准确性和相关性更高(p<0.05)。大语言模型在眼科领域正接近专家级的知识和推理技能。鉴于其与实习级眼科医生和非专科初级医生相当或更优的表现,像GPT-4这样的最先进大语言模型在获取专家眼科医生有限的情况下可能提供有用的医疗建议和帮助。临床基准为在设计和开展临床试验之前评估大语言模型在医疗保健中的能力提供了有用的分析方法。