Tailor Prashant D, Dalvin Lauren A, Chen John J, Iezzi Raymond, Olsen Timothy W, Scruggs Brittni A, Barkmeier Andrew J, Bakri Sophie J, Ryan Edwin H, Tang Peter H, Parke D Wilkin, Belin Peter J, Sridhar Jayanth, Xu David, Kuriyan Ajay E, Yonekawa Yoshihiro, Starr Matthew R
Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota.
Retina Consultants of Minnesota, Edina, Minnesota.
Ophthalmol Sci. 2024 Feb 6;4(4):100485. doi: 10.1016/j.xops.2024.100485. eCollection 2024 Jul-Aug.
To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.
Randomized, masked multicenter study.
Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.
Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).
Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.
There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy ( < 0.001, < 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality ( < 0.001) and empathy ( < 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response ( = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content ( = 0.35), missing content ( = 0.001), extent of possible harm ( = 0.356), and likelihood of possible harm ( = 0.129).
In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.
Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.
评估专家编辑的大语言模型(LLM)、人类专家给出的以及LLM对常见视网膜患者问题的回答的质量、共情能力和安全性。
随机、盲法多中心研究。
21个常见视网膜患者问题被随机分配给13位视网膜专家。
每位专家给出一个回答(专家回答),然后编辑一个由LLM(ChatGPT-4)生成的针对该问题的回答(专家+人工智能[AI]),并记录完成这两项任务的时间。五个LLM(ChatGPT-3.5、ChatGPT-4、Claude 2、必应和巴德)也针对每个问题生成回答。原始问题以及匿名且随机排列的专家+AI、专家和LLM的回答由未针对该问题撰写专家回答的其他专家进行评估。评估者判断回答的质量和共情能力(非常差、差、可接受、好或非常好)以及安全指标(错误信息、造成伤害的可能性、伤害程度和缺失内容)。
每种回答类型的平均质量和共情得分、包含错误信息的回答比例、造成伤害的可能性、伤害程度和缺失内容。
共收集到4008个评分(2608个用于质量和共情;1400个用于安全指标),LLM、专家和专家+AI组在质量和共情方面均存在显著差异(<0.001,<0.001)。在质量方面,专家+AI(3.86±0.85)总体表现最佳,而GPT-3.5(3.75±0.79)是表现最佳的LLM。在共情方面,GPT-3.5(3.75±0.69)的平均得分最高,其次是专家+AI(3.73±0.63)。按平均得分计算,专家在质量方面排名第4,在共情方面排名第6。在质量(<0.001)和共情(<0.001)方面,专家编辑的LLM回答均比专家给出的回答表现更好。与专家给出的回答相比,专家编辑的LLM回答节省了时间(=0.02)。ChatGPT-4在不适当内容(=0.35)、缺失内容(=0.001)、可能的伤害程度(=0.356)和可能造成伤害的可能性(=0.129)方面与专家表现相似。
在这项随机、盲法、多中心研究中,LLM的回答在质量、共情能力和安全指标方面与专家相当,值得进一步探索其在临床环境中的潜在益处。
专有或商业披露信息可在文章末尾的脚注和披露部分找到。