专家、经过专家编辑的大语言模型或仅经过专家编辑的大语言模型对视网膜问题回答的比较研究。

A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone.

作者信息

Tailor Prashant D, Dalvin Lauren A, Chen John J, Iezzi Raymond, Olsen Timothy W, Scruggs Brittni A, Barkmeier Andrew J, Bakri Sophie J, Ryan Edwin H, Tang Peter H, Parke D Wilkin, Belin Peter J, Sridhar Jayanth, Xu David, Kuriyan Ajay E, Yonekawa Yoshihiro, Starr Matthew R

机构信息

Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota.

Retina Consultants of Minnesota, Edina, Minnesota.

出版信息

Ophthalmol Sci. 2024 Feb 6;4(4):100485. doi: 10.1016/j.xops.2024.100485. eCollection 2024 Jul-Aug.

DOI:10.1016/j.xops.2024.100485

PMID:38660460

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11041826/

Abstract

OBJECTIVE

To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.

DESIGN

Randomized, masked multicenter study.

PARTICIPANTS

Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.

METHODS

Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).

MAIN OUTCOME

Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.

RESULTS

There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy ( < 0.001, < 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality ( < 0.001) and empathy ( < 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response ( = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content ( = 0.35), missing content ( = 0.001), extent of possible harm ( = 0.356), and likelihood of possible harm ( = 0.129).

CONCLUSIONS

In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.

FINANCIAL DISCLOSURES

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.

摘要

目的

评估专家编辑的大语言模型（LLM）、人类专家给出的以及LLM对常见视网膜患者问题的回答的质量、共情能力和安全性。

设计

随机、盲法多中心研究。

参与者

21个常见视网膜患者问题被随机分配给13位视网膜专家。

方法

每位专家给出一个回答（专家回答），然后编辑一个由LLM（ChatGPT-4）生成的针对该问题的回答（专家+人工智能[AI]），并记录完成这两项任务的时间。五个LLM（ChatGPT-3.5、ChatGPT-4、Claude 2、必应和巴德）也针对每个问题生成回答。原始问题以及匿名且随机排列的专家+AI、专家和LLM的回答由未针对该问题撰写专家回答的其他专家进行评估。评估者判断回答的质量和共情能力（非常差、差、可接受、好或非常好）以及安全指标（错误信息、造成伤害的可能性、伤害程度和缺失内容）。

主要结果

每种回答类型的平均质量和共情得分、包含错误信息的回答比例、造成伤害的可能性、伤害程度和缺失内容。

结果

共收集到4008个评分（2608个用于质量和共情；1400个用于安全指标），LLM、专家和专家+AI组在质量和共情方面均存在显著差异（<0.001，<0.001）。在质量方面，专家+AI（3.86±0.85）总体表现最佳，而GPT-3.5（3.75±0.79）是表现最佳的LLM。在共情方面，GPT-3.5（3.75±0.69）的平均得分最高，其次是专家+AI（3.73±0.63）。按平均得分计算，专家在质量方面排名第4，在共情方面排名第6。在质量（<0.001）和共情（<0.001）方面，专家编辑的LLM回答均比专家给出的回答表现更好。与专家给出的回答相比，专家编辑的LLM回答节省了时间（=0.02）。ChatGPT-4在不适当内容（=0.35）、缺失内容（=0.001）、可能的伤害程度（=0.356）和可能造成伤害的可能性（=0.129）方面与专家表现相似。

结论

在这项随机、盲法、多中心研究中，LLM的回答在质量、共情能力和安全指标方面与专家相当，值得进一步探索其在临床环境中的潜在益处。

财务披露

专有或商业披露信息可在文章末尾的脚注和披露部分找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d3a/11041826/7556ac0700c1/gr1.jpg

相似文献

A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone.专家、经过专家编辑的大语言模型或仅经过专家编辑的大语言模型对视网膜问题回答的比较研究。

Ophthalmol Sci. 2024 Feb 6;4(4):100485. doi: 10.1016/j.xops.2024.100485. eCollection 2024 Jul-Aug.

A Comparative Study of Large Language Models, Human Experts, and Expert-Edited Large Language Models to Neuro-Ophthalmology Questions.大语言模型、人类专家以及经过专家编辑的大语言模型在神经眼科问题上的比较研究

J Neuroophthalmol. 2025 Mar 1;45(1):71-77. doi: 10.1097/WNO.0000000000002145. Epub 2024 Apr 2.

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.眼科医生与大型语言模型聊天机器人对在线患者眼部护理问题的回复比较。

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions.大型语言模型对放射肿瘤学患者护理问题的回复质量。

JAMA Netw Open. 2024 Apr 1;7(4):e244630. doi: 10.1001/jamanetworkopen.2024.4630.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较：评估研究。

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量：评估研究

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.利用人工智能在减重手术中的应用：ChatGPT-4、Bing 和 Bard 在生成临床医生水平的减重手术建议方面的比较分析。

Surg Obes Relat Dis. 2024 Jul;20(7):603-608. doi: 10.1016/j.soard.2024.03.011. Epub 2024 Mar 24.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力：ChatGPT、谷歌巴德和微软必应的比较研究

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.大型语言模型在 3 个临床专业领域的治疗推荐中的应用：比较研究。

J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.

"Doctor ChatGPT, Can You Help Me?" The Patient's Perspective: Cross-Sectional Study.“医生 ChatGPT，你能帮我吗？”患者视角：横断面研究。

J Med Internet Res. 2024 Oct 1;26:e58831. doi: 10.2196/58831.

引用本文的文献

Large language models in ophthalmology: a scoping review on their utility for clinicians, researchers, patients, and educators.眼科领域的大语言模型：关于其对临床医生、研究人员、患者和教育工作者的效用的范围综述

Eye (Lond). 2025 Aug 25. doi: 10.1038/s41433-025-03935-7.

Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study.评估大语言模型为中国重症肌无力性眼病患者提供患者教育的有效性：混合方法研究

J Med Internet Res. 2025 Apr 10;27:e67883. doi: 10.2196/67883.

Evaluation of AI Summaries on Interdisciplinary Understanding of Ophthalmology Notes.人工智能摘要对眼科笔记跨学科理解的评估

JAMA Ophthalmol. 2025 May 1;143(5):410-419. doi: 10.1001/jamaophthalmol.2025.0351.

Large Language Models in Ophthalmology: A Review of Publications from Top Ophthalmology Journals.眼科领域的大语言模型：顶级眼科期刊出版物综述

Ophthalmol Sci. 2024 Dec 17;5(3):100681. doi: 10.1016/j.xops.2024.100681. eCollection 2025 May-Jun.

Opportunities and Challenges of Chatbots in Ophthalmology: A Narrative Review.眼科领域中聊天机器人的机遇与挑战：一篇叙述性综述

J Pers Med. 2024 Dec 21;14(12):1165. doi: 10.3390/jpm14121165.

Developing and Evaluating Large Language Model-Generated Emergency Medicine Handoff Notes.开发和评估大语言模型生成的急诊医学交接班记录

JAMA Netw Open. 2024 Dec 2;7(12):e2448723. doi: 10.1001/jamanetworkopen.2024.48723.

Large language models in patient education: a scoping review of applications in medicine.用于患者教育的大语言模型：医学应用的范围综述

Front Med (Lausanne). 2024 Oct 29;11:1477898. doi: 10.3389/fmed.2024.1477898. eCollection 2024.

Applications of ChatGPT in the diagnosis, management, education, and research of retinal diseases: a scoping review.ChatGPT在视网膜疾病诊断、管理、教育及研究中的应用：一项范围综述

Int J Retina Vitreous. 2024 Oct 17;10(1):79. doi: 10.1186/s40942-024-00595-9.

本文引用的文献

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

Accuracy of Vitreoretinal Disease Information From an Artificial Intelligence Chatbot.来自人工智能聊天机器人的玻璃体视网膜疾病信息的准确性。

JAMA Ophthalmol. 2023 Sep 1;141(9):906-907. doi: 10.1001/jamaophthalmol.2023.3314.

Experimental evidence on the productivity effects of generative artificial intelligence.关于生成式人工智能生产力效应的实验证据。

Science. 2023 Jul 14;381(6654):187-192. doi: 10.1126/science.adh2586. Epub 2023 Jul 13.

Large language models encode clinical knowledge.大语言模型编码临床知识。

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

Putting ChatGPT's Medical Advice to the (Turing) Test: Survey Study.对ChatGPT的医学建议进行（图灵）测试：调查研究。

JMIR Med Educ. 2023 Jul 10;9:e46939. doi: 10.2196/46939.

Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports.解读放射学报告：OpenAI ChatGPT 潜在应用于增强患者对诊断报告的理解。

Clin Imaging. 2023 Sep;101:137-141. doi: 10.1016/j.clinimag.2023.06.008. Epub 2023 Jun 8.

Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases.ChatGPT-4 生成的回复在视网膜疾病手术治疗中的适宜性和可读性。

Ophthalmol Retina. 2023 Oct;7(10):862-868. doi: 10.1016/j.oret.2023.05.022. Epub 2023 Jun 3.

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。

JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.

Comparison Between ChatGPT and Google Search as Sources of Postoperative Patient Instructions.ChatGPT与谷歌搜索作为术后患者指导信息来源的比较

JAMA Otolaryngol Head Neck Surg. 2023 Jun 1;149(6):556-558. doi: 10.1001/jamaoto.2023.0704.

Trends in Electronic Health Record Inbox Messaging During the COVID-19 Pandemic in an Ambulatory Practice Network in New England.在新英格兰地区的一个门诊实践网络中，COVID-19 大流行期间电子健康记录收件箱消息的趋势。

JAMA Netw Open. 2021 Oct 1;4(10):e2131490. doi: 10.1001/jamanetworkopen.2021.31490.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

专家、经过专家编辑的大语言模型或仅经过专家编辑的大语言模型对视网膜问题回答的比较研究。

A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone.

作者信息

机构信息

出版信息

OBJECTIVE

DESIGN

PARTICIPANTS

METHODS

MAIN OUTCOME

RESULTS

CONCLUSIONS

FINANCIAL DISCLOSURES

目的

设计

参与者

方法

主要结果

结果

结论

财务披露

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献