用于患者咨询和医学教育的EyeGPT：一种眼科大语言模型的开发与验证

EyeGPT for Patient Inquiries and Medical Education: Development and Validation of an Ophthalmology Large Language Model.

作者信息

Chen Xiaolan, Zhao Ziwei, Zhang Weiyi, Xu Pusheng, Wu Yue, Xu Mingpu, Gao Le, Li Yinwen, Shang Xianwen, Shi Danli, He Mingguang

机构信息

School of Optometry, The Hong Kong Polytechnic University, Hong Kong, China.

State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, China.

出版信息

J Med Internet Res. 2024 Dec 11;26:e60063. doi: 10.2196/60063.

DOI:10.2196/60063

PMID:39661433

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11669878/

Abstract

BACKGROUND

Large language models (LLMs) have the potential to enhance clinical flow and improve medical education, but they encounter challenges related to specialized knowledge in ophthalmology.

OBJECTIVE

This study aims to enhance ophthalmic knowledge by refining a general LLM into an ophthalmology-specialized assistant for patient inquiries and medical education.

METHODS

We transformed Llama2 into an ophthalmology-specialized LLM, termed EyeGPT, through the following 3 strategies: prompt engineering for role-playing, fine-tuning with publicly available data sets filtered for eye-specific terminology (83,919 samples), and retrieval-augmented generation leveraging a medical database and 14 ophthalmology textbooks. The efficacy of various EyeGPT variants was evaluated by 4 board-certified ophthalmologists through comprehensive use of 120 diverse category questions in both simple and complex question-answering scenarios. The performance of the best EyeGPT model was then compared with that of the unassisted human physician group and the EyeGPT+human group. We proposed 4 metrics for assessment: accuracy, understandability, trustworthiness, and empathy. The proportion of hallucinations was also reported.

RESULTS

The best fine-tuned model significantly outperformed the original Llama2 model at providing informed advice (mean 9.30, SD 4.42 vs mean 13.79, SD 5.70; P<.001) and mitigating hallucinations (97/120, 80.8% vs 53/120, 44.2%, P<.001). Incorporating information retrieval from reliable sources, particularly ophthalmology textbooks, further improved the model's response compared with solely the best fine-tuned model (mean 13.08, SD 5.43 vs mean 15.14, SD 4.64; P=.001) and reduced hallucinations (71/120, 59.2% vs 57/120, 47.4%, P=.02). Subgroup analysis revealed that EyeGPT showed robustness across common diseases, with consistent performance across different users and domains. Among the variants, the model integrating fine-tuning and book retrieval ranked highest, closely followed by the combination of fine-tuning and the manual database, standalone fine-tuning, and pure role-playing methods. EyeGPT demonstrated competitive capabilities in understandability and empathy when compared with human ophthalmologists. With the assistance of EyeGPT, the performance of the ophthalmologist was notably enhanced.

CONCLUSIONS

We pioneered and introduced EyeGPT by refining a general domain LLM and conducted a comprehensive comparison and evaluation of different strategies to develop an ophthalmology-specific assistant. Our results highlight EyeGPT's potential to assist ophthalmologists and patients in medical settings.

摘要

背景

大语言模型（LLMs）有潜力提升临床流程并改善医学教育，但在眼科专业知识方面面临挑战。

目的

本研究旨在通过将通用大语言模型优化为用于患者咨询和医学教育的眼科专业助手来增强眼科知识。

方法

我们通过以下三种策略将Llama2转变为眼科专业大语言模型，即EyeGPT：用于角色扮演的提示工程、使用针对眼部特定术语筛选的公开数据集（83,919个样本）进行微调，以及利用医学数据库和14本眼科教科书进行检索增强生成。4名获得委员会认证的眼科医生通过在简单和复杂问答场景中综合使用120个不同类别的问题，对各种EyeGPT变体的功效进行了评估。然后将最佳EyeGPT模型的表现与无辅助的人类医生组以及EyeGPT + 人类组的表现进行比较。我们提出了4个评估指标：准确性、可理解性、可信度和同理心。还报告了幻觉的比例。

结果

最佳微调模型在提供明智建议方面显著优于原始Llama2模型（均值9.30，标准差4.42对均值13.79，标准差5.70；P <.001），并减少了幻觉（97/120，80.8%对53/120，44.2%，P <.001）。与仅使用最佳微调模型相比，纳入来自可靠来源（特别是眼科教科书）的信息检索进一步改善了模型的回答（均值13.08，标准差5.43对均值15.14，标准差4.64；P = 0.001）并减少了幻觉（71/120，59.2%对57/120，47.4%，P = 0.02）。亚组分析显示，EyeGPT在常见疾病方面表现稳健，在不同用户和领域中表现一致。在这些变体中，整合微调与书籍检索的模型排名最高，紧随其后的是微调与手动数据库的组合、单独微调以及纯角色扮演方法。与人类眼科医生相比，EyeGPT在可理解性和同理心方面展现出了有竞争力的能力。在EyeGPT的协助下，眼科医生的表现得到了显著提升。

结论

我们通过优化通用领域大语言模型开创并引入了EyeGPT，并对不同策略进行了全面比较和评估，以开发出一款眼科专用助手。我们的结果凸显了EyeGPT在医疗环境中协助眼科医生和患者的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5113/11669878/3a442d2128b1/jmir_v26i1e60063_fig1.jpg

相似文献

EyeGPT for Patient Inquiries and Medical Education: Development and Validation of an Ophthalmology Large Language Model.用于患者咨询和医学教育的EyeGPT：一种眼科大语言模型的开发与验证

J Med Internet Res. 2024 Dec 11;26:e60063. doi: 10.2196/60063.

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.全球医学考试中的大语言模型：平台开发与综合分析

J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断：对流行的大型语言模型的定性研究。

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.眼科医生与大型语言模型聊天机器人对在线患者眼部护理问题的回复比较。

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

A review of ophthalmology education in the era of generative artificial intelligence.眼科教育在生成式人工智能时代的回顾。

Asia Pac J Ophthalmol (Phila). 2024 Jul-Aug;13(4):100089. doi: 10.1016/j.apjo.2024.100089. Epub 2024 Aug 10.

A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试，采用了适配的大语言模型。

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology.开发和评估眼科检索增强型大型语言模型框架。

JAMA Ophthalmol. 2024 Sep 1;142(9):798-805. doi: 10.1001/jamaophthalmol.2024.2513.

Medical education with large language models in ophthalmology: custom instructions and enhanced retrieval capabilities.医学教育与大语言模型在眼科学中的应用：定制指令和增强检索功能。

Br J Ophthalmol. 2024 Sep 20;108(10):1354-1361. doi: 10.1136/bjo-2023-325046.

Automated Pathologic TN Classification Prediction and Rationale Generation From Lung Cancer Surgical Pathology Reports Using a Large Language Model Fine-Tuned With Chain-of-Thought: Algorithm Development and Validation Study.使用思维链微调的大语言模型从肺癌手术病理报告中进行自动病理TN分类预测及依据生成：算法开发与验证研究

JMIR Med Inform. 2024 Dec 20;12:e67056. doi: 10.2196/67056.

Utility of artificial intelligence-based large language models in ophthalmic care.人工智能大型语言模型在眼科护理中的应用。

Ophthalmic Physiol Opt. 2024 May;44(3):641-671. doi: 10.1111/opo.13284. Epub 2024 Feb 25.

引用本文的文献

DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning.在双语复杂眼科推理方面，DeepSeek-R1的表现优于Gemini 2.0 Pro、OpenAI的o1和o3-mini。

Adv Ophthalmol Pract Res. 2025 May 9;5(3):189-195. doi: 10.1016/j.aopr.2025.05.001. eCollection 2025 Aug-Sep.

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study.用于医学问答集成学习的大语言模型协同作用：设计与评估研究

J Med Internet Res. 2025 Jul 14;27:e70080. doi: 10.2196/70080.

Comparative analysis of the performance of the large language models DeepSeek-V3, DeepSeek-R1, open AI-O3 mini and open AI-O3 mini high in urology.大语言模型DeepSeek-V3、DeepSeek-R1、open AI-O3 mini和open AI-O3 mini在泌尿外科领域的性能比较分析。

World J Urol. 2025 Jul 7;43(1):416. doi: 10.1007/s00345-025-05757-4.

Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis.评估ChatGPT在眼科板型考试问题上的表现：一项荟萃分析。

J Med Syst. 2025 Jul 5;49(1):94. doi: 10.1007/s10916-025-02227-7.

Exploring the possibilities and limitations of customized large language model to support and improve cervical cancer screening.探索定制大语言模型以支持和改进宫颈癌筛查的可能性与局限性。

BMC Med Inform Decis Mak. 2025 Jul 1;25(1):242. doi: 10.1186/s12911-025-03088-3.

A multimodal visual-language foundation model for computational ophthalmology.一种用于计算机眼科的多模态视觉语言基础模型。

NPJ Digit Med. 2025 Jun 21;8(1):381. doi: 10.1038/s41746-025-01772-2.

Embodied artificial intelligence in ophthalmology.眼科中的具身人工智能。

NPJ Digit Med. 2025 Jun 11;8(1):351. doi: 10.1038/s41746-025-01754-4.

Slit Lamp Report Generation and Question Answering: Development and Validation of a Multimodal Transformer Model with Large Language Model Integration.裂隙灯报告生成与问答：集成大语言模型的多模态变压器模型的开发与验证

J Med Internet Res. 2024 Dec 30;26:e54047. doi: 10.2196/54047.

本文引用的文献

ChatFFA: An ophthalmic chat system for unified vision-language understanding and question answering for fundus fluorescein angiography.ChatFFA：一种用于眼底荧光血管造影的统一视觉语言理解和问答的眼科聊天系统。

iScience. 2024 May 17;27(7):110021. doi: 10.1016/j.isci.2024.110021. eCollection 2024 Jul 19.

Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial.护士与大语言模型合作的门诊接待：一项随机对照试验。

Nat Med. 2024 Oct;30(10):2878-2885. doi: 10.1038/s41591-024-03148-7. Epub 2024 Jul 15.

Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis.揭示临床能力不足：GPT-4V(ision) 眼科多模态图像分析的基准研究。

Br J Ophthalmol. 2024 Sep 20;108(10):1384-1389. doi: 10.1136/bjo-2023-325054.

FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer.FFA-GPT：一种用于眼底荧光血管造影解释和问答的自动化流程。

NPJ Digit Med. 2024 May 3;7(1):111. doi: 10.1038/s41746-024-01101-z.

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI.用于评估由生成式人工智能驱动的医疗对话有效性的基础指标。

NPJ Digit Med. 2024 Mar 29;7(1):82. doi: 10.1038/s41746-024-01074-z.

A deep-learning model for intracranial aneurysm detection on CT angiography images in China: a stepwise, multicentre, early-stage clinical validation study.中国 CT 血管造影图像上颅内动脉瘤检测的深度学习模型：一项逐步的、多中心的早期临床验证研究。

Lancet Digit Health. 2024 Apr;6(4):e261-e271. doi: 10.1016/S2589-7500(23)00268-6.

ICGA-GPT: report generation and question answering for indocyanine green angiography images.ICGA-GPT：用于吲哚菁绿血管造影图像的报告生成和问答。

Br J Ophthalmol. 2024 Sep 20;108(10):1450-1456. doi: 10.1136/bjo-2023-324446.

Development of a liver disease-specific large language model chat interface using retrieval-augmented generation.使用检索增强生成技术开发肝脏疾病特异性大语言模型聊天界面。

Hepatology. 2024 Nov 1;80(5):1158-1168. doi: 10.1097/HEP.0000000000000834. Epub 2024 Mar 7.

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks.系统分析 ChatGPT、Google 搜索和 Llama 2 在临床决策支持任务中的应用。

Nat Commun. 2024 Mar 6;15(1):2050. doi: 10.1038/s41467-024-46411-8.

Almanac - Retrieval-Augmented Language Models for Clinical Medicine.用于临床医学的年鉴检索增强语言模型。

NEJM AI. 2024 Feb;1(2). doi: 10.1056/aioa2300068. Epub 2024 Jan 25.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于患者咨询和医学教育的EyeGPT：一种眼科大语言模型的开发与验证

EyeGPT for Patient Inquiries and Medical Education: Development and Validation of an Ophthalmology Large Language Model.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献