Shanmugam Sujeeth Krishna, Browning David J
Department of Ophthalmology, Wake Forest University School of Medicine, Winston-Salem, NC, USA.
Clin Ophthalmol. 2024 Nov 12;18:3239-3247. doi: 10.2147/OPTH.S488232. eCollection 2024.
Compare large language models (LLMs) in analyzing and responding to a difficult series of ophthalmic cases.
A comparative case series involving LLMs that met inclusion criteria tested on twenty difficult case studies posed in open-text format.
Fifteen LLMs accessible to ophthalmologists were tested against twenty case studies published in JAMA Ophthalmology. Each case was presented in identical, open-ended text fashion to each LLM and open-ended responses regarding differential diagnosis, next diagnostic tests and recommended treatments were requested. Responses were recorded and assessed for accuracy against published correct answers. The main outcome was accuracy of LLMs against the correct answers. Secondary outcomes included comparative performance on the differential diagnosis, ancillary testing, and treatment subtests; and readability of responses.
Scores were normally distributed and ranged from 0-35 (with a maximum score of 60) with a mean ± standard deviation of 19 ± 9. Scores for three of the LLMs (ChatGPT 3.5, Claude Pro, and Copilot Pro) were statistically significantly higher than the mean. Two of the high-performing LLMs were paid subscription (Claude Pro and Copilot Pro) and one was free (ChatGPT 3.5). While there were no clinical or statistical differences between ChatGPT 3.5 and Claude Pro, a separation of +5 points, or 0.56 standard deviations, between Copilot Pro and the other highly ranked LLMs was present. Readability of all tested programs were above the AMA (American Medical Association) reading level recommendations to public consumers of eight grade.
Subscription LLMs were more prevalent among highly ranked LLMs suggesting that these perform better as ophthalmic assistants. While readability was poor for the average person, the content was understood by a board-certified ophthalmologist. The accuracy of LLMs is not high enough to recommend patient care in standalone mode, but aiding clinicians in patient care and prevent oversights is promising.
比较大语言模型(LLMs)在分析和应对一系列复杂眼科病例方面的表现。
一项比较病例系列研究,涉及符合纳入标准的大语言模型,对以开放文本格式呈现的20个疑难病例进行测试。
针对《美国医学会眼科杂志》发表的20个病例研究,对眼科医生可使用的15个大语言模型进行测试。每个病例以相同的开放式文本形式呈现给每个大语言模型,并要求提供关于鉴别诊断、下一步诊断测试和推荐治疗的开放式回复。记录回复内容,并根据已发表的正确答案评估其准确性。主要结果是大语言模型相对于正确答案的准确性。次要结果包括在鉴别诊断、辅助检查和治疗子测试中的比较表现;以及回复的可读性。
分数呈正态分布,范围为0 - 35(满分60分),平均 ± 标准差为19 ± 9。三个大语言模型(ChatGPT 3.5、Claude Pro和Copilot Pro)的分数在统计学上显著高于平均分。两个表现出色的大语言模型是付费订阅版(Claude Pro和Copilot Pro),一个是免费版(ChatGPT �.5)。虽然ChatGPT 3.5和Claude Pro之间在临床或统计学上没有差异,但Copilot Pro与其他排名靠前的大语言模型之间存在5分或0.56标准差的差距。所有测试程序的可读性均高于美国医学协会(AMA)向公众消费者推荐的八年级阅读水平。
订阅版大语言模型在排名靠前的大语言模型中更为普遍,这表明它们作为眼科助手的表现更好。虽然对于普通人来说可读性较差,但内容能被获得委员会认证的眼科医生理解。大语言模型的准确性还不够高,不足以推荐其独立用于患者护理,但在协助临床医生进行患者护理和防止疏忽方面很有前景。