• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型、人类专家以及经过专家编辑的大语言模型在神经眼科问题上的比较研究

A Comparative Study of Large Language Models, Human Experts, and Expert-Edited Large Language Models to Neuro-Ophthalmology Questions.

作者信息

Tailor Prashant D, Dalvin Lauren A, Starr Matthew R, Tajfirouz Deena A, Chodnicki Kevin D, Brodsky Michael C, Mansukhani Sasha A, Moss Heather E, Lai Kevin E, Ko Melissa W, Mackay Devin D, Di Nome Marie A, Dumitrascu Oana M, Pless Misha L, Eggenberger Eric R, Chen John J

机构信息

Department of Ophthalmology (PDT, LAD, MRS, DAT, KDC, MCB, SAM, JJC), Mayo Clinic, Rochester, Minnesota; Departments of Ophthalmology (HEM) and Neurology & Neurological Sciences (HEM), Stanford University, Palo Alto, California; Department of Ophthalmology (KEL, MWK, DDM), Glick Eye Institute, Indiana University School of Medicine, Indianapolis, Indiana; Ophthalmology Service (KEL), Richard L. Roudebush Veterans' Administration Medical Center, Indianapolis, Indiana; Department of Ophthalmology and Visual Sciences (KEL), University of Louisville, Louisville, Kentucky; Midwest Eye Institute (KEL), Carmel, Indiana; Circle City Neuro-Ophthalmology (KEL), Carmel, Indiana; Department of Neurology (MWK, DDM), Indiana University, Indianapolis, Indiana; Department of Ophthalmology (MADN, OMD), Mayo Clinic, Scottsdale, Arizona; and Department of Ophthalmology (MLP, ERE), Mayo Clinic, Jacksonville, Florida.

出版信息

J Neuroophthalmol. 2025 Mar 1;45(1):71-77. doi: 10.1097/WNO.0000000000002145. Epub 2024 Apr 2.

DOI:10.1097/WNO.0000000000002145
PMID:38564282
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11445389/
Abstract

BACKGROUND

While large language models (LLMs) are increasingly used in medicine, their effectiveness compared with human experts remains unclear. This study evaluates the quality and empathy of Expert + AI, human experts, and LLM responses in neuro-ophthalmology.

METHODS

This randomized, masked, multicenter cross-sectional study was conducted from June to July 2023. We randomly assigned 21 neuro-ophthalmology questions to 13 experts. Each expert provided an answer and then edited a ChatGPT-4-generated response, timing both tasks. In addition, 5 LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, Bard) generated responses. Anonymized and randomized responses from Expert + AI, human experts, and LLMs were evaluated by the remaining 12 experts. The main outcome was the mean score for quality and empathy, rated on a 1-5 scale.

RESULTS

Significant differences existed between response types for both quality and empathy ( P < 0.0001, P < 0.0001). For quality, Expert + AI (4.16 ± 0.81) performed the best, followed by GPT-4 (4.04 ± 0.92), GPT-3.5 (3.99 ± 0.87), Claude (3.6 ± 1.09), Expert (3.56 ± 1.01), Bard (3.5 ± 1.15), and Bing (3.04 ± 1.12). For empathy, Expert + AI (3.63 ± 0.87) had the highest score, followed by GPT-4 (3.6 ± 0.88), Bard (3.54 ± 0.89), GPT-3.5 (3.5 ± 0.83), Bing (3.27 ± 1.03), Expert (3.26 ± 1.08), and Claude (3.11 ± 0.78). For quality ( P < 0.0001) and empathy ( P = 0.002), Expert + AI performed better than Expert. Time taken for expert-created and expert-edited LLM responses was similar ( P = 0.75).

CONCLUSIONS

Expert-edited LLM responses had the highest expert-determined ratings of quality and empathy warranting further exploration of their potential benefits in clinical settings.

摘要

背景

虽然大语言模型(LLMs)在医学领域的应用越来越广泛,但其与人类专家相比的有效性仍不明确。本研究评估了神经眼科领域中专家+人工智能、人类专家和大语言模型回答的质量和同理心。

方法

本随机、盲法、多中心横断面研究于2023年6月至7月进行。我们将21个神经眼科问题随机分配给13位专家。每位专家提供一个答案,然后编辑由ChatGPT-4生成的回答,并记录两项任务的时间。此外,5个大语言模型(ChatGPT-3.5、ChatGPT-4、Claude 2、必应、巴德)生成了回答。来自专家+人工智能、人类专家和大语言模型的匿名且随机的回答由其余12位专家进行评估。主要结果是质量和同理心的平均得分,采用1-5分制进行评分。

结果

在质量和同理心方面,回答类型之间存在显著差异(P<0.0001,P<0.0001)。在质量方面,专家+人工智能(4.16±0.81)表现最佳,其次是GPT-4(4.04±0.92)、GPT-3.5(3.99±0.87)、Claude(3.6±1.09)、专家(3.56±1.01)、巴德(3.5±1.15)和必应(3.04±1.12)。在同理心方面,专家+人工智能(3.63±0.87)得分最高,其次是GPT-4(3.6±0.88)、巴德(3.54±0.89)、GPT-3.5(3.5±0.83)、必应(3.27±1.03)、专家(3.26±1.08)和Claude(3.11±0.78)。在质量方面(P<0.0001)和同理心方面(P=0.002),专家+人工智能的表现优于专家。专家创建和编辑大语言模型回答所花费的时间相似(P=0.75)。

结论

经专家编辑的大语言模型回答在专家确定的质量和同理心评分中最高,值得进一步探索其在临床环境中的潜在益处。

相似文献

1
A Comparative Study of Large Language Models, Human Experts, and Expert-Edited Large Language Models to Neuro-Ophthalmology Questions.大语言模型、人类专家以及经过专家编辑的大语言模型在神经眼科问题上的比较研究
J Neuroophthalmol. 2025 Mar 1;45(1):71-77. doi: 10.1097/WNO.0000000000002145. Epub 2024 Apr 2.
2
A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone.专家、经过专家编辑的大语言模型或仅经过专家编辑的大语言模型对视网膜问题回答的比较研究。
Ophthalmol Sci. 2024 Feb 6;4(4):100485. doi: 10.1016/j.xops.2024.100485. eCollection 2024 Jul-Aug.
3
Is the information provided by large language models valid in educating patients about adolescent idiopathic scoliosis? An evaluation of content, clarity, and empathy : The perspective of the European Spine Study Group.大语言模型提供的信息在对患者进行青少年特发性脊柱侧凸教育方面是否有效?内容、清晰度和同理心的评估:欧洲脊柱研究小组的观点
Spine Deform. 2025 Mar;13(2):361-372. doi: 10.1007/s43390-024-00955-3. Epub 2024 Nov 4.
4
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
5
Can Large Language Models Aid Caregivers of Pediatric Cancer Patients in Information Seeking? A Cross-Sectional Investigation.大语言模型能否帮助儿科癌症患者的护理人员进行信息检索?一项横断面调查。
Cancer Med. 2025 Jan;14(1):e70554. doi: 10.1002/cam4.70554.
6
Advancing health coaching: A comparative study of large language model and health coaches.推进健康辅导:大型语言模型与健康辅导员的比较研究。
Artif Intell Med. 2024 Nov;157:103004. doi: 10.1016/j.artmed.2024.103004. Epub 2024 Oct 19.
7
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
8
Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.大语言模型(ChatGPT、必应搜索和谷歌巴德)在解决生理学病例 vignettes 中的表现。
Cureus. 2023 Aug 4;15(8):e42972. doi: 10.7759/cureus.42972. eCollection 2023 Aug.
9
Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study.基础医学考试中与大语言模型准确性相关的因素:横断面研究
JMIR Med Educ. 2025 Jan 13;11:e58898. doi: 10.2196/58898.
10
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力:ChatGPT、谷歌巴德和微软必应的比较研究
Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

引用本文的文献

1
Evaluation and comparison of large language models' responses to questions related optic neuritis.大语言模型对与视神经炎相关问题的回答的评估与比较
Front Med (Lausanne). 2025 Jun 25;12:1516442. doi: 10.3389/fmed.2025.1516442. eCollection 2025.
2
Evaluation of Responses to Questions About Keratoconus Using ChatGPT-4.0, Google Gemini and Microsoft Copilot: A Comparative Study of Large Language Models on Keratoconus.使用ChatGPT-4.0、谷歌Gemini和微软Copilot评估圆锥角膜相关问题的回答:大型语言模型在圆锥角膜方面的比较研究
Eye Contact Lens. 2025 Mar 1;51(3):e107-e111. doi: 10.1097/ICL.0000000000001158. Epub 2024 Dec 4.
3
Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity.探索ChatGPT-4、必应人工智能和Gemini作为虚拟顾问在向家庭普及早产儿视网膜病变知识方面的作用。
Children (Basel). 2024 Jun 20;11(6):750. doi: 10.3390/children11060750.

本文引用的文献

1
Accuracy of Chatbots in Citing Journal Articles.聊天机器人引用期刊文章的准确性。
JAMA Netw Open. 2023 Aug 1;6(8):e2327647. doi: 10.1001/jamanetworkopen.2023.27647.
2
Accuracy of Vitreoretinal Disease Information From an Artificial Intelligence Chatbot.来自人工智能聊天机器人的玻璃体视网膜疾病信息的准确性。
JAMA Ophthalmol. 2023 Sep 1;141(9):906-907. doi: 10.1001/jamaophthalmol.2023.3314.
3
Experimental evidence on the productivity effects of generative artificial intelligence.关于生成式人工智能生产力效应的实验证据。
Science. 2023 Jul 14;381(6654):187-192. doi: 10.1126/science.adh2586. Epub 2023 Jul 13.
4
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
5
Putting ChatGPT's Medical Advice to the (Turing) Test: Survey Study.对ChatGPT的医学建议进行(图灵)测试:调查研究。
JMIR Med Educ. 2023 Jul 10;9:e46939. doi: 10.2196/46939.
6
Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports.解读放射学报告:OpenAI ChatGPT 潜在应用于增强患者对诊断报告的理解。
Clin Imaging. 2023 Sep;101:137-141. doi: 10.1016/j.clinimag.2023.06.008. Epub 2023 Jun 8.
7
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。
JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.
8
Comparison Between ChatGPT and Google Search as Sources of Postoperative Patient Instructions.ChatGPT与谷歌搜索作为术后患者指导信息来源的比较
JAMA Otolaryngol Head Neck Surg. 2023 Jun 1;149(6):556-558. doi: 10.1001/jamaoto.2023.0704.
9
Trends in Electronic Health Record Inbox Messaging During the COVID-19 Pandemic in an Ambulatory Practice Network in New England.在新英格兰地区的一个门诊实践网络中,COVID-19 大流行期间电子健康记录收件箱消息的趋势。
JAMA Netw Open. 2021 Oct 1;4(10):e2131490. doi: 10.1001/jamanetworkopen.2021.31490.
10
Physicians' electronic inbox work patterns and factors associated with high inbox work duration.医生电子收件箱工作模式及与收件箱工作时间长相关的因素。
J Am Med Inform Assoc. 2021 Apr 23;28(5):923-930. doi: 10.1093/jamia/ocaa229.