一种用于评估放射学报告生成中视觉语言模型的临床信息框架：错误分类与风险感知指标

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric.

作者信息

Guan Hao, Hou Peter C, Hong Pengyu, Wang Liqin, Zhang Wenyu, Du Xinsong, Zhou Zhengyang, Zhou Li

机构信息

Brigham and Women's Hospital, Boston, MA.

Harvard Medical School, Boston, MA.

出版信息

medRxiv. 2025 Jul 14:2025.07.13.25331222. doi: 10.1101/2025.07.13.25331222.

DOI:10.1101/2025.07.13.25331222

PMID:40791731

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12338887/

Abstract

Recent advances in vision-language models (VLMs) have enabled automatic radiology report generation, yet current evaluation methods remain limited to general-purpose NLP metrics or coarse classification-based clinical scores. In this study, we propose a clinically informed evaluation framework for VLM-generated radiology reports that goes beyond traditional performance measures. We define a taxonomy of 12 radiology-specific error types, each annotated with clinical risk levels (low, medium, high) in collaboration with physicians. Using this framework, we conduct a comprehensive error analysis of three representative VLMs, i.e., DeepSeek VL2, CXR-LLaVA, and CheXagent, on 685 gold-standard, expert-annotated MIMIC-CXR cases. We further introduce a risk-aware evaluation metric, the Clinical Risk-weighted Error Score for Text-generation (CREST), to quantify safety impact. Our findings reveal critical model vulnerabilities, common error patterns, and condition-specific risk profiles, offering actionable insights for model development and deployment. This work establishes a safety-centric foundation for evaluating and improving medical report generation models. The source code of our evaluation framework, including CREST computation and error taxonomy analysis, is available at https://github.com/guanharry/VLM-CREST.

摘要

视觉语言模型（VLM）的最新进展使得自动生成放射学报告成为可能，但目前的评估方法仍局限于通用的自然语言处理指标或基于粗略分类的临床评分。在本研究中，我们提出了一个针对VLM生成的放射学报告的临床知情评估框架，该框架超越了传统的性能指标。我们定义了一个包含12种放射学特定错误类型的分类法，每种错误类型都与医生合作标注了临床风险水平（低、中、高）。使用这个框架，我们对685例金标准、专家标注的MIMIC-CXR病例，对三种代表性的VLM，即DeepSeek VL2、CXR-LLaVA和CheXagent进行了全面的错误分析。我们还引入了一种风险感知评估指标，即文本生成的临床风险加权错误评分（CREST），以量化安全影响。我们的研究结果揭示了关键的模型漏洞、常见的错误模式和特定疾病的风险概况，为模型开发和部署提供了可操作的见解。这项工作为评估和改进医疗报告生成模型奠定了以安全为中心的基础。我们评估框架的源代码，包括CREST计算和错误分类法分析，可在https://github.com/guanharry/VLM-CREST上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ae3c/12338887/bef0818fae6a/nihpp-2025.07.13.25331222v1-f0001.jpg

相似文献

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric.

medRxiv. 2025 Jul 14:2025.07.13.25331222. doi: 10.1101/2025.07.13.25331222.

Radiology report generation using automatic keyword adaptation, frequency-based multi-label classification and text-to-text large language models.

Comput Biol Med. 2025 Jul 3;196(Pt A):110625. doi: 10.1016/j.compbiomed.2025.110625.

Prescription of Controlled Substances: Benefits and Risks

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.

Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.

Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.

Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.

Gender differences in the context of interventions for improving health literacy in migrants: a qualitative evidence synthesis.

Cochrane Database Syst Rev. 2024 Dec 12;12(12):CD013302. doi: 10.1002/14651858.CD013302.pub2.

Audit and feedback: effects on professional practice.

Cochrane Database Syst Rev. 2025 Mar 25;3(3):CD000259. doi: 10.1002/14651858.CD000259.pub4.

Antidepressants for pain management in adults with chronic pain: a network meta-analysis.

Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

本文引用的文献

Biomedical Visual Instruction Tuning with Clinician Preference Alignment.

Adv Neural Inf Process Syst. 2024 Dec;37:96449-96467.

CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images.

Eur Radiol. 2025 Jan 15. doi: 10.1007/s00330-024-11339-6.

Vision-language models for medical report generation and visual question answering: a review.

Front Artif Intell. 2024 Nov 19;7:1430984. doi: 10.3389/frai.2024.1430984. eCollection 2024.

Collaboration between clinicians and vision-language models in radiology report generation.

Nat Med. 2025 Feb;31(2):599-608. doi: 10.1038/s41591-024-03302-1. Epub 2024 Nov 7.

A vision-language foundation model for the generation of realistic chest X-ray images.

Nat Biomed Eng. 2025 Apr;9(4):494-506. doi: 10.1038/s41551-024-01246-y. Epub 2024 Aug 26.

From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation.

Med Image Anal. 2024 Oct;97:103264. doi: 10.1016/j.media.2024.103264. Epub 2024 Jul 8.

Vision-Language Models for Vision Tasks: A Survey.

IEEE Trans Pattern Anal Mach Intell. 2024 Aug;46(8):5625-5644. doi: 10.1109/TPAMI.2024.3369699. Epub 2024 Jul 2.

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study.

JMIR Form Res. 2024 Feb 8;8:e32690. doi: 10.2196/32690.

Deep learning in generating radiology reports: A survey.

Artif Intell Med. 2020 Jun;106:101878. doi: 10.1016/j.artmed.2020.101878. Epub 2020 May 15.

MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.

Sci Data. 2019 Dec 12;6(1):317. doi: 10.1038/s41597-019-0322-0.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于评估放射学报告生成中视觉语言模型的临床信息框架：错误分类与风险感知指标

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献