Zhao Ziwei, Zhang Weiyi, Chen Xiaolan, Song Fan, Gunasegaram James, Huang Wenyong, Shi Danli, He Mingguang, Liu Na
School of Optometry, The Hong Kong Polytechnic University, Hong Kong, China.
Monash University, Victoria, Australia.
J Med Internet Res. 2024 Dec 30;26:e54047. doi: 10.2196/54047.
Large language models have shown remarkable efficacy in various medical research and clinical applications. However, their skills in medical image recognition and subsequent report generation or question answering (QA) remain limited.
We aim to finetune a multimodal, transformer-based model for generating medical reports from slit lamp images and develop a QA system using Llama2. We term this entire process slit lamp-GPT.
Our research used a dataset of 25,051 slit lamp images from 3409 participants, paired with their corresponding physician-created medical reports. We used these data, split into training, validation, and test sets, to finetune the Bootstrapping Language-Image Pre-training framework toward report generation. The generated text reports and human-posed questions were then input into Llama2 for subsequent QA. We evaluated performance using qualitative metrics (including BLEU [bilingual evaluation understudy], CIDEr [consensus-based image description evaluation], ROUGE-L [Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence], SPICE [Semantic Propositional Image Caption Evaluation], accuracy, sensitivity, specificity, precision, and F-score) and the subjective assessments of two experienced ophthalmologists on a 1-3 scale (1 referring to high quality).
We identified 50 conditions related to diseases or postoperative complications through keyword matching in initial reports. The refined slit lamp-GPT model demonstrated BLEU scores (1-4) of 0.67, 0.66, 0.65, and 0.65, respectively, with a CIDEr score of 3.24, a ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score of 0.61, and a Semantic Propositional Image Caption Evaluation score of 0.37. The most frequently identified conditions were cataracts (22.95%), age-related cataracts (22.03%), and conjunctival concretion (13.13%). Disease classification metrics demonstrated an overall accuracy of 0.82 and an F-score of 0.64, with high accuracies (≥0.9) observed for intraocular lens, conjunctivitis, and chronic conjunctivitis, and high F-scores (≥0.9) observed for cataract and age-related cataract. For both report generation and QA components, the two evaluating ophthalmologists reached substantial agreement, with κ scores between 0.71 and 0.84. In assessing 100 generated reports, they awarded scores of 1.36 for both completeness and correctness; 64% (64/100) were considered "entirely good," and 93% (93/100) were "acceptable." In the evaluation of 300 generated answers to questions, the scores were 1.33 for completeness, 1.14 for correctness, and 1.15 for possible harm, with 66.3% (199/300) rated as "entirely good" and 91.3% (274/300) as "acceptable."
This study introduces the slit lamp-GPT model for report generation and subsequent QA, highlighting the potential of large language models to assist ophthalmologists and patients.
大语言模型在各种医学研究和临床应用中已显示出显著疗效。然而,它们在医学图像识别以及随后的报告生成或问答(QA)方面的技能仍然有限。
我们旨在微调一个基于变换器的多模态模型,用于从裂隙灯图像生成医学报告,并使用Llama2开发一个问答系统。我们将这个整个过程称为裂隙灯 - GPT。
我们的研究使用了来自3409名参与者的25051张裂隙灯图像数据集,并将其与相应的医生创建的医学报告配对。我们使用这些数据,将其分为训练集、验证集和测试集,以微调自训练语言 - 图像预训练框架以进行报告生成。然后将生成的文本报告和人工提出的问题输入到Llama2中进行后续的问答。我们使用定性指标(包括BLEU [双语评估替代指标]、CIDEr [基于共识的图像描述评估]、ROUGE - L [面向召回的摘要评估替代指标 - 最长公共子序列]、SPICE [语义命题图像字幕评估]、准确率、敏感性、特异性、精确率和F分数)以及两名经验丰富的眼科医生在1 - 3分制(1表示高质量)上的主观评估来评估性能。
我们通过在初始报告中进行关键词匹配识别出50种与疾病或术后并发症相关的情况。经过优化的裂隙灯 - GPT模型的BLEU分数(1 - 4)分别为0.67、0.66、0.65和0.65,CIDEr分数为3.24,ROUGE(面向召回的摘要评估替代指标)分数为0.61,语义命题图像字幕评估分数为0.37。最常识别出的情况是白内障(22.95%)、年龄相关性白内障(22.03%)和结膜结石(13.13%)。疾病分类指标显示总体准确率为0.82,F分数为0.64,人工晶状体、结膜炎和慢性结膜炎的准确率较高(≥0.9),白内障和年龄相关性白内障的F分数较高(≥0.9)。对于报告生成和问答组件,两位评估的眼科医生达成了高度一致,κ分数在0.71至0.84之间。在评估100份生成的报告时,他们在完整性和正确性方面的评分均为1.36;64%(64/100)被认为“非常好”,93%(93/100)“可接受”。在评估300个生成的问题答案时,完整性评分为1.33,正确性评分为1.14,潜在危害评分为1.15,66.3%(199/300)被评为“非常好”,91.3%(274/300)“可接受”。
本研究介绍了用于报告生成和后续问答的裂隙灯 - GPT模型,突出了大语言模型辅助眼科医生和患者的潜力。