Suppr超能文献

大语言模型、人类专家以及经过专家编辑的大语言模型在神经眼科问题上的比较研究

A Comparative Study of Large Language Models, Human Experts, and Expert-Edited Large Language Models to Neuro-Ophthalmology Questions.

作者信息

Tailor Prashant D, Dalvin Lauren A, Starr Matthew R, Tajfirouz Deena A, Chodnicki Kevin D, Brodsky Michael C, Mansukhani Sasha A, Moss Heather E, Lai Kevin E, Ko Melissa W, Mackay Devin D, Di Nome Marie A, Dumitrascu Oana M, Pless Misha L, Eggenberger Eric R, Chen John J

机构信息

Department of Ophthalmology (PDT, LAD, MRS, DAT, KDC, MCB, SAM, JJC), Mayo Clinic, Rochester, Minnesota; Departments of Ophthalmology (HEM) and Neurology & Neurological Sciences (HEM), Stanford University, Palo Alto, California; Department of Ophthalmology (KEL, MWK, DDM), Glick Eye Institute, Indiana University School of Medicine, Indianapolis, Indiana; Ophthalmology Service (KEL), Richard L. Roudebush Veterans' Administration Medical Center, Indianapolis, Indiana; Department of Ophthalmology and Visual Sciences (KEL), University of Louisville, Louisville, Kentucky; Midwest Eye Institute (KEL), Carmel, Indiana; Circle City Neuro-Ophthalmology (KEL), Carmel, Indiana; Department of Neurology (MWK, DDM), Indiana University, Indianapolis, Indiana; Department of Ophthalmology (MADN, OMD), Mayo Clinic, Scottsdale, Arizona; and Department of Ophthalmology (MLP, ERE), Mayo Clinic, Jacksonville, Florida.

出版信息

J Neuroophthalmol. 2025 Mar 1;45(1):71-77. doi: 10.1097/WNO.0000000000002145. Epub 2024 Apr 2.

Abstract

BACKGROUND

While large language models (LLMs) are increasingly used in medicine, their effectiveness compared with human experts remains unclear. This study evaluates the quality and empathy of Expert + AI, human experts, and LLM responses in neuro-ophthalmology.

METHODS

This randomized, masked, multicenter cross-sectional study was conducted from June to July 2023. We randomly assigned 21 neuro-ophthalmology questions to 13 experts. Each expert provided an answer and then edited a ChatGPT-4-generated response, timing both tasks. In addition, 5 LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, Bard) generated responses. Anonymized and randomized responses from Expert + AI, human experts, and LLMs were evaluated by the remaining 12 experts. The main outcome was the mean score for quality and empathy, rated on a 1-5 scale.

RESULTS

Significant differences existed between response types for both quality and empathy ( P < 0.0001, P < 0.0001). For quality, Expert + AI (4.16 ± 0.81) performed the best, followed by GPT-4 (4.04 ± 0.92), GPT-3.5 (3.99 ± 0.87), Claude (3.6 ± 1.09), Expert (3.56 ± 1.01), Bard (3.5 ± 1.15), and Bing (3.04 ± 1.12). For empathy, Expert + AI (3.63 ± 0.87) had the highest score, followed by GPT-4 (3.6 ± 0.88), Bard (3.54 ± 0.89), GPT-3.5 (3.5 ± 0.83), Bing (3.27 ± 1.03), Expert (3.26 ± 1.08), and Claude (3.11 ± 0.78). For quality ( P < 0.0001) and empathy ( P = 0.002), Expert + AI performed better than Expert. Time taken for expert-created and expert-edited LLM responses was similar ( P = 0.75).

CONCLUSIONS

Expert-edited LLM responses had the highest expert-determined ratings of quality and empathy warranting further exploration of their potential benefits in clinical settings.

摘要

背景

虽然大语言模型(LLMs)在医学领域的应用越来越广泛,但其与人类专家相比的有效性仍不明确。本研究评估了神经眼科领域中专家+人工智能、人类专家和大语言模型回答的质量和同理心。

方法

本随机、盲法、多中心横断面研究于2023年6月至7月进行。我们将21个神经眼科问题随机分配给13位专家。每位专家提供一个答案,然后编辑由ChatGPT-4生成的回答,并记录两项任务的时间。此外,5个大语言模型(ChatGPT-3.5、ChatGPT-4、Claude 2、必应、巴德)生成了回答。来自专家+人工智能、人类专家和大语言模型的匿名且随机的回答由其余12位专家进行评估。主要结果是质量和同理心的平均得分,采用1-5分制进行评分。

结果

在质量和同理心方面,回答类型之间存在显著差异(P<0.0001,P<0.0001)。在质量方面,专家+人工智能(4.16±0.81)表现最佳,其次是GPT-4(4.04±0.92)、GPT-3.5(3.99±0.87)、Claude(3.6±1.09)、专家(3.56±1.01)、巴德(3.5±1.15)和必应(3.04±1.12)。在同理心方面,专家+人工智能(3.63±0.87)得分最高,其次是GPT-4(3.6±0.88)、巴德(3.54±0.89)、GPT-3.5(3.5±0.83)、必应(3.27±1.03)、专家(3.26±1.08)和Claude(3.11±0.78)。在质量方面(P<0.0001)和同理心方面(P=0.002),专家+人工智能的表现优于专家。专家创建和编辑大语言模型回答所花费的时间相似(P=0.75)。

结论

经专家编辑的大语言模型回答在专家确定的质量和同理心评分中最高,值得进一步探索其在临床环境中的潜在益处。

相似文献

本文引用的文献

1
Accuracy of Chatbots in Citing Journal Articles.聊天机器人引用期刊文章的准确性。
JAMA Netw Open. 2023 Aug 1;6(8):e2327647. doi: 10.1001/jamanetworkopen.2023.27647.
4
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验