• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

眼科医生与大型语言模型聊天机器人对在线患者眼部护理问题的回复比较。

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.

机构信息

Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California.

Department of Ophthalmology, Kaiser Permanente San Francisco, San Francisco, California.

出版信息

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

DOI:10.1001/jamanetworkopen.2023.30320
PMID:37606922
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10445188/
Abstract

IMPORTANCE

Large language models (LLMs) like ChatGPT appear capable of performing a variety of tasks, including answering patient eye care questions, but have not yet been evaluated in direct comparison with ophthalmologists. It remains unclear whether LLM-generated advice is accurate, appropriate, and safe for eye patients.

OBJECTIVE

To evaluate the quality of ophthalmology advice generated by an LLM chatbot in comparison with ophthalmologist-written advice.

DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study used deidentified data from an online medical forum, in which patient questions received responses written by American Academy of Ophthalmology (AAO)-affiliated ophthalmologists. A masked panel of 8 board-certified ophthalmologists were asked to distinguish between answers generated by the ChatGPT chatbot and human answers. Posts were dated between 2007 and 2016; data were accessed January 2023 and analysis was performed between March and May 2023.

MAIN OUTCOMES AND MEASURES

Identification of chatbot and human answers on a 4-point scale (likely or definitely artificial intelligence [AI] vs likely or definitely human) and evaluation of responses for presence of incorrect information, alignment with perceived consensus in the medical community, likelihood to cause harm, and extent of harm.

RESULTS

A total of 200 pairs of user questions and answers by AAO-affiliated ophthalmologists were evaluated. The mean (SD) accuracy for distinguishing between AI and human responses was 61.3% (9.7%). Of 800 evaluations of chatbot-written answers, 168 answers (21.0%) were marked as human-written, while 517 of 800 human-written answers (64.6%) were marked as AI-written. Compared with human answers, chatbot answers were more frequently rated as probably or definitely written by AI (prevalence ratio [PR], 1.72; 95% CI, 1.52-1.93). The likelihood of chatbot answers containing incorrect or inappropriate material was comparable with human answers (PR, 0.92; 95% CI, 0.77-1.10), and did not differ from human answers in terms of likelihood of harm (PR, 0.84; 95% CI, 0.67-1.07) nor extent of harm (PR, 0.99; 95% CI, 0.80-1.22).

CONCLUSIONS AND RELEVANCE

In this cross-sectional study of human-written and AI-generated responses to 200 eye care questions from an online advice forum, a chatbot appeared capable of responding to long user-written eye health posts and largely generated appropriate responses that did not differ significantly from ophthalmologist-written responses in terms of incorrect information, likelihood of harm, extent of harm, or deviation from ophthalmologist community standards. Additional research is needed to assess patient attitudes toward LLM-augmented ophthalmologists vs fully autonomous AI content generation, to evaluate clarity and acceptability of LLM-generated answers from the patient perspective, to test the performance of LLMs in a greater variety of clinical contexts, and to determine an optimal manner of utilizing LLMs that is ethical and minimizes harm.

摘要

重要性

像 ChatGPT 这样的大型语言模型似乎能够执行各种任务,包括回答患者的眼科护理问题,但尚未与眼科医生进行直接比较评估。目前尚不清楚 LLM 生成的建议是否准确、适当且对眼科患者安全。

目的

评估 LLM 聊天机器人生成的眼科建议与眼科医生书写的建议的质量。

设计、设置和参与者:这是一项横断面研究,使用了在线医学论坛中的匿名数据,其中患者问题的回复是由美国眼科学会 (AAO) 附属眼科医生撰写的。一个由 8 名具有董事会认证的眼科医生组成的盲法小组被要求区分 ChatGPT 聊天机器人生成的答案和人类答案。帖子的日期在 2007 年至 2016 年之间;数据于 2023 年 1 月获取,分析于 2023 年 3 月至 5 月进行。

主要结果和措施

以 4 分制(可能或肯定是人工智能 [AI] 与可能或肯定是人类)来识别聊天机器人和人类的答案,并评估答案中是否存在错误信息、与医学界共识的一致性、造成伤害的可能性以及伤害的程度。

结果

共评估了 200 对 AAO 附属眼科医生的用户问题和答案。区分 AI 和人类回复的准确率(均数 [SD])为 61.3%(9.7%)。在 800 次对聊天机器人撰写的答案的评估中,有 168 次(21.0%)被标记为人类撰写,而在 800 次人类撰写的答案中,有 517 次(64.6%)被标记为 AI 撰写。与人类答案相比,聊天机器人答案更频繁地被标记为可能或肯定是由 AI 撰写的(优势比 [PR],1.72;95%CI,1.52-1.93)。聊天机器人答案包含错误或不适当内容的可能性与人类答案相似(PR,0.92;95%CI,0.77-1.10),在造成伤害的可能性(PR,0.84;95%CI,0.67-1.07)或伤害程度(PR,0.99;95%CI,0.80-1.22)方面与人类答案无差异。

结论和相关性

在这项对在线咨询论坛中 200 个眼科护理问题的人类撰写和 AI 生成回复的横断面研究中,聊天机器人似乎能够回复用户撰写的长篇眼科健康帖子,并生成了大部分适当的回复,在错误信息、造成伤害的可能性、伤害程度或与眼科医生社区标准的偏差方面,与眼科医生撰写的回复没有显著差异。需要进一步研究来评估患者对眼科医生增强型人工智能与完全自主的 AI 内容生成的态度,从患者角度评估 LLM 生成的答案的清晰度和可接受性,在更大范围的临床环境中测试 LLM 的性能,并确定一种合乎道德且最大限度减少伤害的利用 LLM 的最佳方式。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/35f8/10445188/67c80ed65624/jamanetwopen-e2330320-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/35f8/10445188/b6ffb440a155/jamanetwopen-e2330320-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/35f8/10445188/67c80ed65624/jamanetwopen-e2330320-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/35f8/10445188/b6ffb440a155/jamanetwopen-e2330320-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/35f8/10445188/67c80ed65624/jamanetwopen-e2330320-g002.jpg

相似文献

1
Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.眼科医生与大型语言模型聊天机器人对在线患者眼部护理问题的回复比较。
JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.
2
Dr. Google vs. Dr. ChatGPT: Exploring the Use of Artificial Intelligence in Ophthalmology by Comparing the Accuracy, Safety, and Readability of Responses to Frequently Asked Patient Questions Regarding Cataracts and Cataract Surgery.谷歌医生与ChatGPT医生:通过比较关于白内障及白内障手术的常见患者问题的回答的准确性、安全性和可读性,探索人工智能在眼科领域的应用。
Semin Ophthalmol. 2024 Aug;39(6):472-479. doi: 10.1080/08820538.2024.2326058. Epub 2024 Mar 22.
3
Assessment of a Large Language Model's Responses to Questions and Cases About Glaucoma and Retina Management.评估大型语言模型对青光眼和视网膜管理相关问题和病例的回答。
JAMA Ophthalmol. 2024 Apr 1;142(4):371-375. doi: 10.1001/jamaophthalmol.2023.6917.
4
Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions.大型语言模型对放射肿瘤学患者护理问题的回复质量。
JAMA Netw Open. 2024 Apr 1;7(4):e244630. doi: 10.1001/jamanetworkopen.2024.4630.
5
A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone.专家、经过专家编辑的大语言模型或仅经过专家编辑的大语言模型对视网膜问题回答的比较研究。
Ophthalmol Sci. 2024 Feb 6;4(4):100485. doi: 10.1016/j.xops.2024.100485. eCollection 2024 Jul-Aug.
6
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
7
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。
JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.
8
"Doctor ChatGPT, Can You Help Me?" The Patient's Perspective: Cross-Sectional Study.“医生 ChatGPT,你能帮我吗?”患者视角:横断面研究。
J Med Internet Res. 2024 Oct 1;26:e58831. doi: 10.2196/58831.
9
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量:评估研究
ArXiv. 2024 Jan 23:arXiv:2402.01693v1.
10
Accuracy and Reliability of Chatbot Responses to Physician Questions.聊天机器人对医生提问回答的准确性和可靠性。
JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483.

引用本文的文献

1
Prompt Engineering in Clinical Practice: Tutorial for Clinicians.临床实践中的提示工程:临床医生教程
J Med Internet Res. 2025 Sep 15;27:e72644. doi: 10.2196/72644.
2
Reducing Hallucinations and Trade-Offs in Responses in Generative AI Chatbots for Cancer Information: Development and Evaluation Study.减少生成式人工智能聊天机器人提供癌症信息时的幻觉及反应权衡:开发与评估研究
JMIR Cancer. 2025 Sep 11;11:e70176. doi: 10.2196/70176.
3
Clinical decision-making for uveal melanoma radiotherapy: comparative performance of experienced radiation oncologists and leading generative AI models.

本文引用的文献

1
ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.ChatGPT 让医学文献通俗易懂:简化放射学报告的探索性案例研究。
Eur Radiol. 2024 May;34(5):2817-2825. doi: 10.1007/s00330-023-10213-1. Epub 2023 Oct 5.
2
ChatGPT and Ophthalmology: Exploring Its Potential with Discharge Summaries and Operative Notes.ChatGPT 与眼科学:从出院小结和手术记录探索其潜力。
Semin Ophthalmol. 2023 Jul;38(5):503-507. doi: 10.1080/08820538.2023.2209166. Epub 2023 May 3.
3
Artificial intelligence-based ChatGPT chatbot responses for patient and parent questions on vernal keratoconjunctivitis.
葡萄膜黑色素瘤放疗的临床决策:经验丰富的放射肿瘤学家与领先的生成式人工智能模型的比较表现
Front Oncol. 2025 Aug 14;15:1605916. doi: 10.3389/fonc.2025.1605916. eCollection 2025.
4
Systematic Review on Large Language Models in Orthopaedic Surgery.骨科手术中大型语言模型的系统评价
J Clin Med. 2025 Aug 20;14(16):5876. doi: 10.3390/jcm14165876.
5
ChatGPT and human dietitian responses to diet-related questions on an online Q&A platform: A comparative study.ChatGPT与人类营养师在在线问答平台上对饮食相关问题的回答:一项比较研究。
Digit Health. 2025 Aug 21;11:20552076251361381. doi: 10.1177/20552076251361381. eCollection 2025 Jan-Dec.
6
Large language models in ophthalmology: a scoping review on their utility for clinicians, researchers, patients, and educators.眼科领域的大语言模型:关于其对临床医生、研究人员、患者和教育工作者的效用的范围综述
Eye (Lond). 2025 Aug 25. doi: 10.1038/s41433-025-03935-7.
7
Application of artificial intelligence chatbots in interpreting magnetic resonance imaging reports: a comparative study.人工智能聊天机器人在解读磁共振成像报告中的应用:一项对比研究。
Sci Rep. 2025 Aug 25;15(1):31266. doi: 10.1038/s41598-025-17355-w.
8
Evaluating a Chatbot as a Companion for Patients With Breast Cancer: Collaborative Pilot Study.评估聊天机器人作为乳腺癌患者陪伴者的效果:协作性试点研究。
JMIR Cancer. 2025 Aug 13;11:e68426. doi: 10.2196/68426.
9
Use of a Medical Communication Framework to Assess the Quality of Generative Artificial Intelligence Replies to Primary Care Patient Portal Messages: Content Analysis.使用医学交流框架评估生成式人工智能对基层医疗患者门户消息的回复质量:内容分析
JMIR Form Res. 2025 Jul 31;9:e71966. doi: 10.2196/71966.
10
Benchmarking AI Chatbots for Maternal Lactation Support: A Cross-Platform Evaluation of Quality, Readability, and Clinical Accuracy.用于产妇泌乳支持的人工智能聊天机器人基准测试:质量、可读性和临床准确性的跨平台评估
Healthcare (Basel). 2025 Jul 20;13(14):1756. doi: 10.3390/healthcare13141756.
基于人工智能的ChatGPT聊天机器人对患者及家长关于春季角结膜炎问题的回复。
Graefes Arch Clin Exp Ophthalmol. 2023 Oct;261(10):3041-3043. doi: 10.1007/s00417-023-06078-1. Epub 2023 May 2.
4
Enhancing Expert Panel Discussions in Pediatric Palliative Care: Innovative Scenario Development and Summarization With ChatGPT-4.加强儿科姑息治疗专家小组讨论:利用ChatGPT-4进行创新情景开发与总结
Cureus. 2023 Apr 28;15(4):e38249. doi: 10.7759/cureus.38249. eCollection 2023 Apr.
5
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。
JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.
6
Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery.评估语言模型 ChatGPT 对肥胖症手术相关问题回答的准确性。
Obes Surg. 2023 Jun;33(6):1790-1796. doi: 10.1007/s11695-023-06603-5. Epub 2023 Apr 27.
7
Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment.人工智能聊天机器人在眼科知识评估中的表现。
JAMA Ophthalmol. 2023 Jun 1;141(6):589-597. doi: 10.1001/jamaophthalmol.2023.1144.
8
Aesthetic Surgery Advice and Counseling from Artificial Intelligence: A Rhinoplasty Consultation with ChatGPT.人工智能提供的美容外科建议和咨询:ChatGPT 参与的隆鼻咨询。
Aesthetic Plast Surg. 2023 Oct;47(5):1985-1993. doi: 10.1007/s00266-023-03338-7. Epub 2023 Apr 24.
9
What if your patient switches from Dr. Google to Dr. ChatGPT? A vignette-based survey of the trustworthiness, value, and danger of ChatGPT-generated responses to health questions.如果你的患者从谷歌医生转向了 ChatGPT 医生,你会怎么办?基于病例的调查,评估 ChatGPT 生成的健康问题回答的可信度、价值和危险。
Eur J Cardiovasc Nurs. 2024 Jan 12;23(1):95-98. doi: 10.1093/eurjcn/zvad038.
10
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.GPT-4作为医学人工智能聊天机器人的益处、局限性和风险
N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.