• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型在眼科领域接近专家级临床知识和推理能力:一项直接比较的横断面研究。

Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study.

作者信息

Thirunavukarasu Arun James, Mahmood Shathar, Malem Andrew, Foster William Paul, Sanghera Rohan, Hassan Refaat, Zhou Sean, Wong Shiao Wei, Wong Yee Ling, Chong Yu Jeat, Shakeel Abdullah, Chang Yin-Hsi, Tan Benjamin Kye Jyn, Jain Nikhil, Tan Ting Fang, Rauz Saaeha, Ting Daniel Shu Wei, Ting Darren Shu Jeng

机构信息

University of Cambridge School of Clinical Medicine, Cambridge, United Kingdom.

Oxford University Clinical Academic Graduate School, University of Oxford, Oxford, United Kingdom.

出版信息

PLOS Digit Health. 2024 Apr 17;3(4):e0000341. doi: 10.1371/journal.pdig.0000341. eCollection 2024 Apr.

DOI:10.1371/journal.pdig.0000341
PMID:38630683
原文链接:
https://pmc.ncbi.nlm.nih.gov/articles/PMC11023493/
Abstract

Large language models (LLMs) underlie remarkable recent advanced in natural language processing, and they are beginning to be applied in clinical contexts. We aimed to evaluate the clinical potential of state-of-the-art LLMs in ophthalmology using a more robust benchmark than raw examination scores. We trialled GPT-3.5 and GPT-4 on 347 ophthalmology questions before GPT-3.5, GPT-4, PaLM 2, LLaMA, expert ophthalmologists, and doctors in training were trialled on a mock examination of 87 questions. Performance was analysed with respect to question subject and type (first order recall and higher order reasoning). Masked ophthalmologists graded the accuracy, relevance, and overall preference of GPT-3.5 and GPT-4 responses to the same questions. The performance of GPT-4 (69%) was superior to GPT-3.5 (48%), LLaMA (32%), and PaLM 2 (56%). GPT-4 compared favourably with expert ophthalmologists (median 76%, range 64-90%), ophthalmology trainees (median 59%, range 57-63%), and unspecialised junior doctors (median 43%, range 41-44%). Low agreement between LLMs and doctors reflected idiosyncratic differences in knowledge and reasoning with overall consistency across subjects and types (p>0.05). All ophthalmologists preferred GPT-4 responses over GPT-3.5 and rated the accuracy and relevance of GPT-4 as higher (p<0.05). LLMs are approaching expert-level knowledge and reasoning skills in ophthalmology. In view of the comparable or superior performance to trainee-grade ophthalmologists and unspecialised junior doctors, state-of-the-art LLMs such as GPT-4 may provide useful medical advice and assistance where access to expert ophthalmologists is limited. Clinical benchmarks provide useful assays of LLM capabilities in healthcare before clinical trials can be designed and conducted.

摘要

大语言模型(LLMs)是近期自然语言处理领域取得显著进展的基础,并且它们开始被应用于临床环境。我们旨在使用比原始考试分数更可靠的基准来评估最先进的大语言模型在眼科领域的临床潜力。我们让GPT-3.5和GPT-4回答了347个眼科问题,然后让GPT-3.5、GPT-4、PaLM 2、LLaMA、眼科专家和实习医生参加了一场包含87个问题的模拟考试。根据问题主题和类型(一阶回忆和高阶推理)对表现进行了分析。蒙面眼科医生对GPT-3.5和GPT-4对相同问题的回答的准确性、相关性和总体偏好进行了评分。GPT-4的表现(69%)优于GPT-3.5(48%)、LLaMA(32%)和PaLM 2(56%)。GPT-4与眼科专家(中位数76%,范围64 - 90%)、眼科实习生(中位数59%,范围57 - 63%)和非专科初级医生(中位数43%,范围41 - 44%)相比表现良好。大语言模型和医生之间的一致性较低,反映出知识和推理方面的特质差异,而在不同主题和类型之间总体具有一致性(p>0.05)。所有眼科医生都更喜欢GPT-4的回答而不是GPT-3.5的回答,并认为GPT-4的准确性和相关性更高(p<0.05)。大语言模型在眼科领域正接近专家级的知识和推理技能。鉴于其与实习级眼科医生和非专科初级医生相当或更优的表现,像GPT-4这样的最先进大语言模型在获取专家眼科医生有限的情况下可能提供有用的医疗建议和帮助。临床基准为在设计和开展临床试验之前评估大语言模型在医疗保健中的能力提供了有用的分析方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/029b/11023493/7436e48f98e7/pdig.0000341.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/029b/11023493/dc0aec17f452/pdig.0000341.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/029b/11023493/d28b6f7ad715/pdig.0000341.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/029b/11023493/7436e48f98e7/pdig.0000341.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/029b/11023493/dc0aec17f452/pdig.0000341.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/029b/11023493/d28b6f7ad715/pdig.0000341.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/029b/11023493/7436e48f98e7/pdig.0000341.g003.jpg

相似文献

1
Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study.大语言模型在眼科领域接近专家级临床知识和推理能力:一项直接比较的横断面研究。
PLOS Digit Health. 2024 Apr 17;3(4):e0000341. doi: 10.1371/journal.pdig.0000341. eCollection 2024 Apr.
2
Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力:定性研究
JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.
3
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
4
Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model.使用检索增强语言模型提高GPT-3/4在生物医学数据上的结果准确性。
PLOS Digit Health. 2024 Aug 21;3(8):e0000568. doi: 10.1371/journal.pdig.0000568. eCollection 2024 Aug.
5
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量:评估研究
ArXiv. 2024 Jan 23:arXiv:2402.01693v1.
6
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
7
Assessment of a Large Language Model's Responses to Questions and Cases About Glaucoma and Retina Management.评估大型语言模型对青光眼和视网膜管理相关问题和病例的回答。
JAMA Ophthalmol. 2024 Apr 1;142(4):371-375. doi: 10.1001/jamaophthalmol.2023.6917.
8
Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions.利用大语言模型进行化疗诱导毒性的精准监测:一项专家比较及未来方向的试点研究
Cancers (Basel). 2024 Aug 12;16(16):2830. doi: 10.3390/cancers16162830.
9
Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology.开发和评估眼科检索增强型大型语言模型框架。
JAMA Ophthalmol. 2024 Sep 1;142(9):798-805. doi: 10.1001/jamaophthalmol.2024.2513.
10
Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study.评估四种大型语言模型解答中国患者关于干眼症问题的性能:一项两阶段研究。
Heliyon. 2024 Jul 14;10(14):e34391. doi: 10.1016/j.heliyon.2024.e34391. eCollection 2024 Jul 30.

引用本文的文献

1
Large language models in ophthalmology: a scoping review on their utility for clinicians, researchers, patients, and educators.眼科领域的大语言模型:关于其对临床医生、研究人员、患者和教育工作者的效用的范围综述
Eye (Lond). 2025 Aug 25. doi: 10.1038/s41433-025-03935-7.
2
Evaluating chatbots in psychiatry: Rasch-based insights into clinical knowledge and reasoning.评估精神病学中的聊天机器人:基于拉施模型对临床知识和推理的见解。
PLoS One. 2025 Aug 14;20(8):e0330303. doi: 10.1371/journal.pone.0330303. eCollection 2025.
3
ChatGPT-4o and OpenAI-o1: A Comparative Analysis of Its Accuracy in Refractive Surgery.

本文引用的文献

1
How Can the Clinical Aptitude of AI Assistants Be Assayed?人工智能助手的临床能力如何评估?
J Med Internet Res. 2023 Dec 5;25:e51603. doi: 10.2196/51603.
2
Generative Artificial Intelligence Through ChatGPT and Other Large Language Models in Ophthalmology: Clinical Applications and Challenges.通过ChatGPT和其他大语言模型实现的生成式人工智能在眼科中的临床应用与挑战
Ophthalmol Sci. 2023 Sep 9;3(4):100394. doi: 10.1016/j.xops.2023.100394. eCollection 2023 Dec.
3
Improved Performance of ChatGPT-4 on the OKAP Examination: A Comparative Study with ChatGPT-3.5.
ChatGPT-4o与OpenAI-o1:屈光手术中其准确性的比较分析。
J Clin Med. 2025 Jul 22;14(15):5175. doi: 10.3390/jcm14155175.
4
Evaluating AI Versus Examiner Feedback in Ophthalmology Exit Examinations: A Pilot Study.评估人工智能与考官反馈在眼科结业考试中的作用:一项试点研究。
Cureus. 2025 Jul 9;17(7):e87591. doi: 10.7759/cureus.87591. eCollection 2025 Jul.
5
Evaluation of performance of generative large language models for stroke care.用于中风护理的生成式大语言模型的性能评估。
NPJ Digit Med. 2025 Jul 29;8(1):481. doi: 10.1038/s41746-025-01830-9.
6
Accuracy of ChatGPT, Gemini, Copilot, and Claude to Blepharoplasty-Related Questions.ChatGPT、Gemini、Copilot和Claude对双眼皮手术相关问题的回答准确性。
Aesthetic Plast Surg. 2025 Jul 21. doi: 10.1007/s00266-025-05071-9.
7
Treatment allocation in ophthalmological randomised-control trials (TAO-RCT): A cross-sectional meta-research study.眼科随机对照试验中的治疗分配(TAO-RCT):一项横断面元研究。
Eye (Lond). 2025 Jul 17. doi: 10.1038/s41433-025-03922-y.
8
How valuable are the questions and answers generated by large language models in oral and maxillofacial surgery?大语言模型生成的问答在口腔颌面外科中有多大价值?
PLoS One. 2025 May 28;20(5):e0322529. doi: 10.1371/journal.pone.0322529. eCollection 2025.
9
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性:系统评价与网络荟萃分析
J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.
10
Evaluating the Performance and Safety of Large Language Models in Generating Type 2 Diabetes Mellitus Management Plans: A Comparative Study With Physicians Using Real Patient Records.评估大语言模型生成2型糖尿病管理计划的性能和安全性:一项使用真实患者记录与医生进行的对比研究。
Cureus. 2025 Mar 17;17(3):e80737. doi: 10.7759/cureus.80737. eCollection 2025 Mar.
ChatGPT-4在医师执照考试(OKAP)中的表现提升:与ChatGPT-3.5的对比研究
J Acad Ophthalmol (2017). 2023 Sep 11;15(2):e184-e187. doi: 10.1055/s-0043-1774399. eCollection 2023 Jul.
4
Artificial intelligence and digital health in global eye health: opportunities and challenges.人工智能和数字健康在全球眼健康中的机遇与挑战。
Lancet Glob Health. 2023 Sep;11(9):e1432-e1443. doi: 10.1016/S2214-109X(23)00323-6.
5
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
6
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
7
The imperative for regulatory oversight of large language models (or generative AI) in healthcare.对医疗保健领域的大语言模型(或生成式人工智能)进行监管监督的必要性。
NPJ Digit Med. 2023 Jul 6;6(1):120. doi: 10.1038/s41746-023-00873-0.
8
Artificial Intelligence in Clinical Diagnosis: Opportunities, Challenges, and Hype.临床诊断中的人工智能:机遇、挑战与炒作。
JAMA. 2023 Jul 25;330(4):317-318. doi: 10.1001/jama.2023.11440.
9
ChatGPT in ophthalmology: the dawn of a new era?眼科领域的ChatGPT:新时代的曙光?
Eye (Lond). 2024 Jan;38(1):4-7. doi: 10.1038/s41433-023-02619-4. Epub 2023 Jun 27.
10
AI in health: keeping the human in the loop.健康领域的人工智能:让人类参与其中。
J Am Med Inform Assoc. 2023 Jun 20;30(7):1225-1226. doi: 10.1093/jamia/ocad091.