Khan Shaheryar Ahmed, Gunasekera Chrishan
Ophthalmology Department, Moorfields Eye Hospital, London, UK.
Ophthalmology Department, Norfolk & Norwich University Hospital, Norwich, UK.
Eye (Lond). 2025 May;39(7):1301-1308. doi: 10.1038/s41433-025-03605-8. Epub 2025 Jan 21.
This study presents a comprehensive evaluation of the performance of various large language models in generating responses for ophthalmology emergencies and compares their accuracy with the established United Kingdom's National Health Service 111 online system.
We included 21 ophthalmology-related emergency scenario questions from the NHS 111 triaging algorithm. These questions were based on four different ophthalmology emergency themes as laid out in the NHS 111 algorithm. Responses generated from NHS 111 online, were compared to different LLM-chatbots responses to determine the accuracy of LLM responses. We included a range of models including ChatGPT-3.5, Google Bard, Bing Chat, and ChatGPT-4.0. The accuracy of each LLM-chatbot response was compared against the NHS 111 Triage using a two-prompt strategy. Answers were graded as following: -2 graded as "Very poor", -1 as "Poor", O as "No response", 1 as "Good", 2 as "Very good" and 3 graded as "Excellent".
Overall LLMs' attained a good accuracy in this study compared against the NHS 111 responses. The score of ≥1 graded as "Good" was achieved by 93% responses of all LLMs. This refers to at least part of this answer having correct information as well as absence of any wrong information. There was no marked difference and very similar results seen overall on both prompts.
The high accuracy and safety observed in LLM responses support their potential as effective tools for providing timely information and guidance to patients. LLMs hold promise in enhancing patient care and healthcare accessibility in digital age.
本研究全面评估了各种大语言模型在生成眼科急诊回复方面的性能,并将其准确性与英国国家医疗服务体系111在线系统进行比较。
我们纳入了英国国家医疗服务体系111分诊算法中的21个与眼科相关的急诊场景问题。这些问题基于英国国家医疗服务体系111算法中列出的四个不同的眼科急诊主题。将英国国家医疗服务体系111在线生成的回复与不同的大语言模型聊天机器人的回复进行比较,以确定大语言模型回复的准确性。我们纳入了一系列模型,包括ChatGPT-3.5、谷歌巴德、必应聊天和ChatGPT-4.0。使用双提示策略将每个大语言模型聊天机器人回复的准确性与英国国家医疗服务体系111分诊进行比较。答案的评分如下:-2评为“非常差”,-1评为“差”,0评为“无回复”,1评为“好”,2评为“非常好”,3评为“优秀”。
与英国国家医疗服务体系111的回复相比,在本研究中,大语言模型总体上达到了较高的准确性。所有大语言模型93%的回复获得了≥1分(评为“好”)。这意味着该答案至少部分包含正确信息且无任何错误信息。在两个提示下,总体上没有明显差异,结果非常相似。
在大语言模型回复中观察到的高准确性和安全性支持了它们作为向患者提供及时信息和指导的有效工具的潜力。在数字时代,大语言模型有望改善患者护理并提高医疗服务的可及性。