• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

裂隙灯报告生成与问答:集成大语言模型的多模态变压器模型的开发与验证

Slit Lamp Report Generation and Question Answering: Development and Validation of a Multimodal Transformer Model with Large Language Model Integration.

作者信息

Zhao Ziwei, Zhang Weiyi, Chen Xiaolan, Song Fan, Gunasegaram James, Huang Wenyong, Shi Danli, He Mingguang, Liu Na

机构信息

School of Optometry, The Hong Kong Polytechnic University, Hong Kong, China.

Monash University, Victoria, Australia.

出版信息

J Med Internet Res. 2024 Dec 30;26:e54047. doi: 10.2196/54047.

DOI:10.2196/54047
PMID:39753218
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11729784/
Abstract

BACKGROUND

Large language models have shown remarkable efficacy in various medical research and clinical applications. However, their skills in medical image recognition and subsequent report generation or question answering (QA) remain limited.

OBJECTIVE

We aim to finetune a multimodal, transformer-based model for generating medical reports from slit lamp images and develop a QA system using Llama2. We term this entire process slit lamp-GPT.

METHODS

Our research used a dataset of 25,051 slit lamp images from 3409 participants, paired with their corresponding physician-created medical reports. We used these data, split into training, validation, and test sets, to finetune the Bootstrapping Language-Image Pre-training framework toward report generation. The generated text reports and human-posed questions were then input into Llama2 for subsequent QA. We evaluated performance using qualitative metrics (including BLEU [bilingual evaluation understudy], CIDEr [consensus-based image description evaluation], ROUGE-L [Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence], SPICE [Semantic Propositional Image Caption Evaluation], accuracy, sensitivity, specificity, precision, and F-score) and the subjective assessments of two experienced ophthalmologists on a 1-3 scale (1 referring to high quality).

RESULTS

We identified 50 conditions related to diseases or postoperative complications through keyword matching in initial reports. The refined slit lamp-GPT model demonstrated BLEU scores (1-4) of 0.67, 0.66, 0.65, and 0.65, respectively, with a CIDEr score of 3.24, a ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score of 0.61, and a Semantic Propositional Image Caption Evaluation score of 0.37. The most frequently identified conditions were cataracts (22.95%), age-related cataracts (22.03%), and conjunctival concretion (13.13%). Disease classification metrics demonstrated an overall accuracy of 0.82 and an F-score of 0.64, with high accuracies (≥0.9) observed for intraocular lens, conjunctivitis, and chronic conjunctivitis, and high F-scores (≥0.9) observed for cataract and age-related cataract. For both report generation and QA components, the two evaluating ophthalmologists reached substantial agreement, with κ scores between 0.71 and 0.84. In assessing 100 generated reports, they awarded scores of 1.36 for both completeness and correctness; 64% (64/100) were considered "entirely good," and 93% (93/100) were "acceptable." In the evaluation of 300 generated answers to questions, the scores were 1.33 for completeness, 1.14 for correctness, and 1.15 for possible harm, with 66.3% (199/300) rated as "entirely good" and 91.3% (274/300) as "acceptable."

CONCLUSIONS

This study introduces the slit lamp-GPT model for report generation and subsequent QA, highlighting the potential of large language models to assist ophthalmologists and patients.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9933/11729784/acd0a85f743d/jmir_v26i1e54047_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9933/11729784/6fc1efab0384/jmir_v26i1e54047_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9933/11729784/acd0a85f743d/jmir_v26i1e54047_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9933/11729784/6fc1efab0384/jmir_v26i1e54047_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9933/11729784/acd0a85f743d/jmir_v26i1e54047_fig2.jpg
摘要

背景

大语言模型在各种医学研究和临床应用中已显示出显著疗效。然而,它们在医学图像识别以及随后的报告生成或问答(QA)方面的技能仍然有限。

目的

我们旨在微调一个基于变换器的多模态模型,用于从裂隙灯图像生成医学报告,并使用Llama2开发一个问答系统。我们将这个整个过程称为裂隙灯 - GPT。

方法

我们的研究使用了来自3409名参与者的25051张裂隙灯图像数据集,并将其与相应的医生创建的医学报告配对。我们使用这些数据,将其分为训练集、验证集和测试集,以微调自训练语言 - 图像预训练框架以进行报告生成。然后将生成的文本报告和人工提出的问题输入到Llama2中进行后续的问答。我们使用定性指标(包括BLEU [双语评估替代指标]、CIDEr [基于共识的图像描述评估]、ROUGE - L [面向召回的摘要评估替代指标 - 最长公共子序列]、SPICE [语义命题图像字幕评估]、准确率、敏感性、特异性、精确率和F分数)以及两名经验丰富的眼科医生在1 - 3分制(1表示高质量)上的主观评估来评估性能。

结果

我们通过在初始报告中进行关键词匹配识别出50种与疾病或术后并发症相关的情况。经过优化的裂隙灯 - GPT模型的BLEU分数(1 - 4)分别为0.67、0.66、0.65和0.65,CIDEr分数为3.24,ROUGE(面向召回的摘要评估替代指标)分数为0.61,语义命题图像字幕评估分数为0.37。最常识别出的情况是白内障(22.95%)、年龄相关性白内障(22.03%)和结膜结石(13.13%)。疾病分类指标显示总体准确率为0.82,F分数为0.64,人工晶状体、结膜炎和慢性结膜炎的准确率较高(≥0.9),白内障和年龄相关性白内障的F分数较高(≥0.9)。对于报告生成和问答组件,两位评估的眼科医生达成了高度一致,κ分数在0.71至0.84之间。在评估100份生成的报告时,他们在完整性和正确性方面的评分均为1.36;64%(64/100)被认为“非常好”,93%(93/100)“可接受”。在评估300个生成的问题答案时,完整性评分为1.33,正确性评分为1.14,潜在危害评分为1.15,66.3%(199/300)被评为“非常好”,91.3%(274/300)“可接受”。

结论

本研究介绍了用于报告生成和后续问答的裂隙灯 - GPT模型,突出了大语言模型辅助眼科医生和患者的潜力。

相似文献

1
Slit Lamp Report Generation and Question Answering: Development and Validation of a Multimodal Transformer Model with Large Language Model Integration.裂隙灯报告生成与问答:集成大语言模型的多模态变压器模型的开发与验证
J Med Internet Res. 2024 Dec 30;26:e54047. doi: 10.2196/54047.
2
Development of a Large-Scale Dataset of Chest Computed Tomography Reports in Japanese and a High-Performance Finding Classification Model: Dataset Development and Validation Study.日语胸部计算机断层扫描报告大规模数据集的开发及高性能发现分类模型:数据集开发与验证研究
JMIR Med Inform. 2025 Aug 28;13:e71137. doi: 10.2196/71137.
3
Menstrual Health Education Using a Specialized Large Language Model in India: Development and Evaluation Study of MenstLLaMA.在印度使用专门的大语言模型进行月经健康教育:MenstLLaMA的开发与评估研究
J Med Internet Res. 2025 Jul 16;27:e71977. doi: 10.2196/71977.
4
ICGA-GPT: report generation and question answering for indocyanine green angiography images.ICGA-GPT:用于吲哚菁绿血管造影图像的报告生成和问答。
Br J Ophthalmol. 2024 Sep 20;108(10):1450-1456. doi: 10.1136/bjo-2023-324446.
5
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
6
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
7
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
8
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
9
Validation of the Eyesi Slit Lamp Simulator for a Simulated Training Curriculum for Residents in Ophthalmology: A National Delphi study.用于眼科住院医师模拟培训课程的Eysi裂隙灯模拟器的验证:一项全国性德尔菲研究。
Ophthalmol Ther. 2025 Jul 4. doi: 10.1007/s40123-025-01175-2.
10
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试(USMLE)问题上高精度背后的隐藏挑战:观察性研究。
J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.

引用本文的文献

1
DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning.在双语复杂眼科推理方面,DeepSeek-R1的表现优于Gemini 2.0 Pro、OpenAI的o1和o3-mini。
Adv Ophthalmol Pract Res. 2025 May 9;5(3):189-195. doi: 10.1016/j.aopr.2025.05.001. eCollection 2025 Aug-Sep.
2
Large language models in the management of chronic ocular diseases: a scoping review.大语言模型在慢性眼病管理中的应用:一项范围综述
Front Cell Dev Biol. 2025 Jun 18;13:1608988. doi: 10.3389/fcell.2025.1608988. eCollection 2025.
3
A multimodal visual-language foundation model for computational ophthalmology.

本文引用的文献

1
Toward expert-level medical question answering with large language models.迈向使用大语言模型实现专家级医学问答
Nat Med. 2025 Mar;31(3):943-950. doi: 10.1038/s41591-024-03423-7. Epub 2025 Jan 8.
2
EyeGPT for Patient Inquiries and Medical Education: Development and Validation of an Ophthalmology Large Language Model.用于患者咨询和医学教育的EyeGPT:一种眼科大语言模型的开发与验证
J Med Internet Res. 2024 Dec 11;26:e60063. doi: 10.2196/60063.
3
ChatFFA: An ophthalmic chat system for unified vision-language understanding and question answering for fundus fluorescein angiography.
一种用于计算机眼科的多模态视觉语言基础模型。
NPJ Digit Med. 2025 Jun 21;8(1):381. doi: 10.1038/s41746-025-01772-2.
ChatFFA:一种用于眼底荧光血管造影的统一视觉语言理解和问答的眼科聊天系统。
iScience. 2024 May 17;27(7):110021. doi: 10.1016/j.isci.2024.110021. eCollection 2024 Jul 19.
4
Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis.揭示临床能力不足:GPT-4V(ision) 眼科多模态图像分析的基准研究。
Br J Ophthalmol. 2024 Sep 20;108(10):1384-1389. doi: 10.1136/bjo-2023-325054.
5
FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer.FFA-GPT:一种用于眼底荧光血管造影解释和问答的自动化流程。
NPJ Digit Med. 2024 May 3;7(1):111. doi: 10.1038/s41746-024-01101-z.
6
ICGA-GPT: report generation and question answering for indocyanine green angiography images.ICGA-GPT:用于吲哚菁绿血管造影图像的报告生成和问答。
Br J Ophthalmol. 2024 Sep 20;108(10):1450-1456. doi: 10.1136/bjo-2023-324446.
7
From the diagnosis of infectious keratitis to discriminating fungal subtypes; a deep learning-based study.从感染性角膜炎的诊断到真菌亚型的鉴别;一项基于深度学习的研究。
Sci Rep. 2023 Dec 14;13(1):22200. doi: 10.1038/s41598-023-49635-8.
8
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
9
Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases.ChatGPT-4 生成的回复在视网膜疾病手术治疗中的适宜性和可读性。
Ophthalmol Retina. 2023 Oct;7(10):862-868. doi: 10.1016/j.oret.2023.05.022. Epub 2023 Jun 3.
10
Large language models will not replace healthcare professionals: curbing popular fears and hype.大语言模型不会取代医疗保健专业人员:抑制普遍的恐惧和炒作。
J R Soc Med. 2023 May;116(5):181-182. doi: 10.1177/01410768231173123. Epub 2023 May 18.