Suppr超能文献

用于医学诊断的语言模型中的视觉文本整合:初步定量分析

Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis.

作者信息

Agbareia Reem, Omar Mahmud, Soffer Shelly, Glicksberg Benjamin S, Nadkarni Girish N, Klang Eyal

机构信息

Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel.

Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA.

出版信息

Comput Struct Biotechnol J. 2024 Dec 22;27:184-189. doi: 10.1016/j.csbj.2024.12.019. eCollection 2025.

Abstract

BACKGROUND AND AIM

Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes.

METHODS

We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief concern, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases.

RESULTS

LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8 %, Claude Sonnet 3.5: 59.5 %, Physicians: 39.5 %, p < 0.001, Bonferroni-adjusted). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5 %, p < 0.001; Claude Sonnet 3.5: 67.3 %, p = 0.060; Physicians: 78.8 %, p < 0.001, all Bonferroni-adjusted). LLMs altered their explanatory reasoning in 45-60 % of cases when images were provided.

CONCLUSION

Multimodal LLMs showed higher diagnostic accuracy than physicians in text-only scenarios, even in cases designed to require visual interpretation, suggesting that while images can enhance diagnostic accuracy, they may not be essential in every instance. Although adding images further improved LLM performance, the magnitude of this improvement was smaller than that observed in physicians. These findings suggest that enhanced visual data processing may be needed for LLMs to achieve the degree of image-related performance gains seen in human examiners.

摘要

背景与目的

图像中的视觉数据对许多医学诊断至关重要。本研究评估了多模态大语言模型(LLMs)在整合文本和视觉信息以用于诊断目的方面的性能。

方法

我们在120个有或没有附带图像的临床病例 vignette 上测试了GPT - 4o和Claude Sonnet 3.5。每个病例 vignette 包括患者人口统计学信息、主要诉求和相关病史。病例 vignette 与来自两个来源的临床或放射学图像配对:来自OPENi数据库的100张图像和来自近期《新英格兰医学杂志》挑战中的20张图像,确保它们不在大语言模型的训练集中。三名初级保健医生作为人类基准。我们分析了诊断准确性以及模型对一部分病例的解释。

结果

在仅文本的情况下,大语言模型的表现优于医生(GPT - 4o:70.8%,Claude Sonnet 3.5:59.5%,医生:39.5%,p < 0.001,经Bonferroni校正)。随着图像整合,所有组的表现都有所提高,但医生的提升幅度最大(GPT - 4o:84.5%,p < 0.001;Claude Sonnet 3.5:67.3%,p = 0.060;医生:78.8%,p < 0.001,均经Bonferroni校正)。当提供图像时,大语言模型在45% - 60%的病例中改变了其解释性推理。

结论

在仅文本的情况下,多模态大语言模型显示出比医生更高的诊断准确性,即使在设计为需要视觉解读的病例中也是如此,这表明虽然图像可以提高诊断准确性,但在每种情况下可能并非必不可少。虽然添加图像进一步提高了大语言模型的性能,但这种提升幅度小于在医生中观察到的幅度。这些发现表明,大语言模型可能需要增强视觉数据处理能力,以实现人类检查者所看到的与图像相关的性能提升程度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d47b/11754970/4c173a385c38/ga1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验