文献检索，用中文搜 PubMed

Translating the intricate anatomical signatures of retinal disease from OCT B-scans into clear, accurate clinical narratives demands AI models that seamlessly fuse visual features with domain expertise. We curated a multimodal dataset of 40,000 OCT B-scans from public repositories and private clinical cohorts, each paired with expert validated summaries spanning six conditions: diabetic macular edema, diabetic retinopathy, geographic atrophy, drusen, choroidal neovascularization, and healthy retina. We introduce LO-VLM, a compact (247M parameter) vision-language model (VLM) that infuses anatomical guidance into both encoder and decoder for free form summary generation and multiclass disease classification. Benchmarking against state-of-the-art RetinaVLM, LLaVA-Med, and a ViT vision only model demonstrates superior performance. In a blinded evaluation by three board certified retina specialists scored the generated summaries, LO-VLM narratives achieved mean = 8.5 (standard deviation = 1.15) out of 10, compared to mean = 5.5 (standard deviation = 1.13) for RetinaVLM (p < 0.0001). In quantitative evaluations, LO-VLM achieved an SBERT similarity of 0.803 and a BERTScore F1 of 0.715, representing improvements of 8.2% and 28.8% over specialized VLM baselines. For disease classification, LO-VLM reached 96% accuracy (F1 = 96%), outperforming ViT by 13% and exceeding medical VLM benchmarks by over 62%. By reconciling interpretability with computational efficiency, LO-VLM establishes a new paradigm for efficient AI models in OCT interpretation.

Compact Vision-Language Models Enable Efficient and Interpretable Automated OCT Analysis Through Layer Specific Multimodal Learning.

作者信息

Haghighi Tania, Gholami Sina, Sokol Jared Todd, Lim Jennifer I, Leng Theodore, Thompson Atalie C, Tabkhi Hamed, Alam Minhaj Nur

机构信息

Department of Electrical and Computer Engineering, University of North Carolina at Charlotte, Charlotte, NC 28223, USA.

Byers Eye Institute at Stanford, Stanford University School of Medicine, Stanford, CA 94305, USA.

出版信息

bioRxiv. 2025 Aug 11:2025.08.07.669187. doi: 10.1101/2025.08.07.669187.

将视网膜疾病复杂的解剖学特征从光学相干断层扫描（OCT）B 扫描转化为清晰、准确的临床描述，需要能够将视觉特征与领域专业知识无缝融合的人工智能模型。我们从公共存储库和私人临床队列中精心策划了一个包含 40000 张 OCT B 扫描的多模态数据集，每个扫描都与专家验证的涵盖六种病症的总结配对：糖尿病性黄斑水肿、糖尿病视网膜病变、地图样萎缩、玻璃膜疣、脉络膜新生血管和健康视网膜。我们引入了 LO-VLM，这是一种紧凑的（247M 参数）视觉语言模型（VLM），它将解剖学指导融入编码器和解码器，以生成自由形式的总结和多类疾病分类。与最先进的 RetinaVLM、LLaVA-Med 和仅基于视觉Transformer（ViT）的模型进行基准测试，结果显示 LO-VLM 具有卓越的性能。在由三位获得董事会认证的视网膜专家进行的盲法评估中，对生成的总结进行评分，LO-VLM 的描述平均得分为 8.5（标准差 = 1.15）（满分 10 分），而 RetinaVLM 的平均得分为 5.5（标准差 = 1.13）（p < 0.0001）。在定量评估中，LO-VLM 的 SBERT 相似度为 0.803，BERTScore F1 为 0.715，分别比专门的 VLM 基线提高了 8.2%和 28.8%。对于疾病分类，LO-VLM 的准确率达到 96%（F1 = 96%），比 ViT 高出 13%，超过医学 VLM 基准超过 62%。通过兼顾可解释性和计算效率，LO-VLM 为 OCT 解读中的高效人工智能模型建立了新的范式。