人工智能衍生的大语言模型在葡萄膜炎决策过程中的应用

Artificial intelligence derived large language model in decision-making process in uveitis.

作者信息

Schumacher Inès, Bühler Virginie Manuela Marie, Jaggi Damian, Roth Janice

机构信息

Department of Ophthalmology, Inselspital, University Hospital of Bern, Bern, Switzerland.

Moorfields Eye Hospital NHS Foundation Trust, City Road, EC1V 2, London, PD, UK.

出版信息

Int J Retina Vitreous. 2024 Sep 11;10(1):63. doi: 10.1186/s40942-024-00581-1.

DOI:10.1186/s40942-024-00581-1

PMID:39261870

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11389245/

Abstract

BACKGROUND

Uveitis is the ophthalmic subfield dealing with a broad range of intraocular inflammatory diseases. With the raising importance of LLM such as ChatGPT and their potential use in the medical field, this research explores the strengths and weaknesses of its applicability in the subfield of uveitis.

METHODS

A series of highly clinically relevant questions were asked three consecutive times (attempts 1, 2 and 3) of the LLM regarding current uveitis cases. The answers were classified on whether they were accurate and sufficient, partially accurate and sufficient or inaccurate and insufficient. Statistical analysis included descriptive analysis, normality distribution, non-parametric test and reliability tests. References were checked for their correctness in different medical databases.

RESULTS

The data showed non-normal distribution. Data between subgroups (attempts 1, 2 and 3) was comparable (Kruskal-Wallis H test, p-value = 0.7338). There was a moderate agreement between attempt 1 and attempt 2 (Cohen's kappa, ĸ = 0.5172) as well as between attempt 2 and attempt 3 (Cohen's kappa, ĸ = 0.4913). There was a fair agreement between attempt 1 and attempt 3 (Cohen's kappa, ĸ = 0.3647). The average agreement was moderate (Cohen's kappa, ĸ = 0.4577). Between the three attempts together, there was a moderate agreement (Fleiss' kappa, ĸ = 0.4534). A total of 52 references were generated by the LLM. 22 references (42.3%) were found to be accurate and correctly cited. Another 22 references (42.3%) could not be located in any of the searched databases. The remaining 8 references (15.4%) were found to exist, but were either misinterpreted or incorrectly cited by the LLM.

CONCLUSION

Our results demonstrate the significant potential of LLMs in uveitis. However, their implementation requires rigorous training and comprehensive testing for specific medical tasks. We also found out that the references made by ChatGPT 4.o were in most cases incorrect. LLMs are likely to become invaluable tools in shaping the future of ophthalmology, enhancing clinical decision-making and patient care.

摘要

背景

葡萄膜炎是眼科领域中涉及广泛的眼内炎症性疾病的分支。随着诸如ChatGPT等大语言模型（LLM）的重要性日益提高及其在医学领域的潜在应用，本研究探讨了其在葡萄膜炎领域适用性的优势与劣势。

方法

针对当前葡萄膜炎病例，连续三次（尝试1、尝试2和尝试3）向大语言模型提出一系列高度临床相关的问题。根据答案是否准确充分、部分准确充分或不准确不充分进行分类。统计分析包括描述性分析、正态分布、非参数检验和可靠性检验。在不同医学数据库中检查参考文献的正确性。

结果

数据呈非正态分布。亚组（尝试1、尝试2和尝试3）之间的数据具有可比性（Kruskal-Wallis H检验，p值 = 0.7338）。尝试1和尝试2之间存在中度一致性（Cohen's kappa，κ = 0.5172），尝试2和尝试3之间也存在中度一致性（Cohen's kappa，κ = 0.4913）。尝试1和尝试3之间存在尚可的一致性（Cohen's kappa，κ = 0.3647）。平均一致性为中度（Cohen's kappa，κ = 0.4577）。三次尝试之间总体存在中度一致性（Fleiss' kappa，κ = 0.4534）。大语言模型共生成了52条参考文献。22条参考文献（42.3%）被发现是准确且引用正确的。另外22条参考文献（42.3%）在任何搜索到的数据库中都未找到。其余8条参考文献（15.4%）被发现存在，但被大语言模型错误解读或引用错误。

结论

我们的结果证明了大语言模型在葡萄膜炎方面具有巨大潜力。然而，它们的应用需要针对特定医学任务进行严格训练和全面测试。我们还发现，ChatGPT 4.0给出的参考文献在大多数情况下是不正确的。大语言模型很可能会成为塑造眼科未来、加强临床决策和患者护理的宝贵工具。

相似文献

Artificial intelligence derived large language model in decision-making process in uveitis.人工智能衍生的大语言模型在葡萄膜炎决策过程中的应用

Int J Retina Vitreous. 2024 Sep 11;10(1):63. doi: 10.1186/s40942-024-00581-1.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较：评估研究。

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能：比较混合方法研究。

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量：评估研究

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较：大型语言模型、ChatGPT 和未经训练的急诊医生：一项对比研究。

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力：ChatGPT、谷歌巴德和微软必应的比较研究

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Potential of Large Language Models in Health Care: Delphi Study.大语言模型在医疗保健中的潜力：德尔菲研究。

J Med Internet Res. 2024 May 13;26:e52399. doi: 10.2196/52399.

ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions?ChatGPT-3.5与谷歌巴德：哪种大语言模型对常见的怀孕问题回答得最好？

Cureus. 2024 Jul 27;16(7):e65543. doi: 10.7759/cureus.65543. eCollection 2024 Jul.

Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis.聊天生成预训练转换器（ChatGPT）和巴德：人工智能尚未为髋和膝关节骨关节炎提供临床支持的答案。

J Arthroplasty. 2024 May;39(5):1184-1190. doi: 10.1016/j.arth.2024.01.029. Epub 2024 Jan 17.

Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions.大型语言模型对放射肿瘤学患者护理问题的回复质量。

JAMA Netw Open. 2024 Apr 1;7(4):e244630. doi: 10.1001/jamanetworkopen.2024.4630.

引用本文的文献

Performance analysis of an emergency triage system in ophthalmology using a customized CHATBOT.使用定制聊天机器人对眼科急诊分诊系统进行性能分析

Digit Health. 2025 May 11;11:20552076251320298. doi: 10.1177/20552076251320298. eCollection 2025 Jan-Dec.

Opportunities and Challenges of Chatbots in Ophthalmology: A Narrative Review.眼科领域中聊天机器人的机遇与挑战：一篇叙述性综述

J Pers Med. 2024 Dec 21;14(12):1165. doi: 10.3390/jpm14121165.

本文引用的文献

Assessing large language models' accuracy in providing patient support for choroidal melanoma.评估大型语言模型在为脉络膜黑色素瘤患者提供支持方面的准确性。

Eye (Lond). 2024 Nov;38(16):3113-3117. doi: 10.1038/s41433-024-03231-w. Epub 2024 Jul 13.

Investigating the Accuracy and Completeness of an Artificial Intelligence Large Language Model About Uveitis: An Evaluation of ChatGPT.探讨一款关于葡萄膜炎的人工智能大语言模型的准确性和完整性：ChatGPT 的评估。

Ocul Immunol Inflamm. 2024 Nov;32(9):2052-2055. doi: 10.1080/09273948.2024.2317417. Epub 2024 Feb 23.

Accuracy of ChatGPT-3.5 and -4 in providing scientific references in otolaryngology-head and neck surgery.ChatGPT-3.5和-4在提供耳鼻咽喉头颈外科学术参考文献方面的准确性。

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2159-2165. doi: 10.1007/s00405-023-08441-8. Epub 2024 Jan 11.

"Application and accuracy of artificial intelligence-derived large language models in patients with age related macular degeneration".人工智能衍生的大语言模型在年龄相关性黄斑变性患者中的应用及准确性

Int J Retina Vitreous. 2023 Nov 18;9(1):71. doi: 10.1186/s40942-023-00511-7.

How accurate are the references generated by ChatGPT in internal medicine?ChatGPT生成的内科参考文献有多准确？

Intern Emerg Med. 2024 Jan;19(1):247-249. doi: 10.1007/s11739-023-03484-5. Epub 2023 Nov 18.

Chatbots Vs. Human Experts: Evaluating Diagnostic Performance of Chatbots in Uveitis and the Perspectives on AI Adoption in Ophthalmology.聊天机器人与人类专家：评估聊天机器人在葡萄膜炎中的诊断性能以及人工智能在眼科中的应用前景。

Ocul Immunol Inflamm. 2024 Oct;32(8):1591-1598. doi: 10.1080/09273948.2023.2266730. Epub 2023 Oct 13.

Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.医学专业人员的新兴技能：提示工程教程

J Med Internet Res. 2023 Oct 4;25:e50638. doi: 10.2196/50638.

Large language models in vitreoretinal surgery.玻璃体视网膜手术中的大语言模型

Eye (Lond). 2024 Mar;38(4):809-810. doi: 10.1038/s41433-023-02751-1. Epub 2023 Sep 19.

The Potential Role of Large Language Models in Uveitis Care: Perspectives After ChatGPT and Bard Launch.大语言模型在葡萄膜炎护理中的潜在作用：ChatGPT和Bard发布后的观点

Ocul Immunol Inflamm. 2024 Sep;32(7):1435-1439. doi: 10.1080/09273948.2023.2242462. Epub 2023 Aug 10.

Diagnosis, Classification, and Assessment of the Underlying Etiology of Uveitis by Artificial Intelligence: A Systematic Review.人工智能在葡萄膜炎潜在病因诊断、分类及评估中的应用：一项系统综述

J Clin Med. 2023 May 29;12(11):3746. doi: 10.3390/jcm12113746.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验