Suppr超能文献

一种用于评估放射学报告生成中视觉语言模型的临床信息框架:错误分类与风险感知指标

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric.

作者信息

Guan Hao, Hou Peter C, Hong Pengyu, Wang Liqin, Zhang Wenyu, Du Xinsong, Zhou Zhengyang, Zhou Li

机构信息

Brigham and Women's Hospital, Boston, MA.

Harvard Medical School, Boston, MA.

出版信息

medRxiv. 2025 Jul 14:2025.07.13.25331222. doi: 10.1101/2025.07.13.25331222.

Abstract

Recent advances in vision-language models (VLMs) have enabled automatic radiology report generation, yet current evaluation methods remain limited to general-purpose NLP metrics or coarse classification-based clinical scores. In this study, we propose a clinically informed evaluation framework for VLM-generated radiology reports that goes beyond traditional performance measures. We define a taxonomy of 12 radiology-specific error types, each annotated with clinical risk levels (low, medium, high) in collaboration with physicians. Using this framework, we conduct a comprehensive error analysis of three representative VLMs, i.e., DeepSeek VL2, CXR-LLaVA, and CheXagent, on 685 gold-standard, expert-annotated MIMIC-CXR cases. We further introduce a risk-aware evaluation metric, the Clinical Risk-weighted Error Score for Text-generation (CREST), to quantify safety impact. Our findings reveal critical model vulnerabilities, common error patterns, and condition-specific risk profiles, offering actionable insights for model development and deployment. This work establishes a safety-centric foundation for evaluating and improving medical report generation models. The source code of our evaluation framework, including CREST computation and error taxonomy analysis, is available at https://github.com/guanharry/VLM-CREST.

摘要

视觉语言模型(VLM)的最新进展使得自动生成放射学报告成为可能,但目前的评估方法仍局限于通用的自然语言处理指标或基于粗略分类的临床评分。在本研究中,我们提出了一个针对VLM生成的放射学报告的临床知情评估框架,该框架超越了传统的性能指标。我们定义了一个包含12种放射学特定错误类型的分类法,每种错误类型都与医生合作标注了临床风险水平(低、中、高)。使用这个框架,我们对685例金标准、专家标注的MIMIC-CXR病例,对三种代表性的VLM,即DeepSeek VL2、CXR-LLaVA和CheXagent进行了全面的错误分析。我们还引入了一种风险感知评估指标,即文本生成的临床风险加权错误评分(CREST),以量化安全影响。我们的研究结果揭示了关键的模型漏洞、常见的错误模式和特定疾病的风险概况,为模型开发和部署提供了可操作的见解。这项工作为评估和改进医疗报告生成模型奠定了以安全为中心的基础。我们评估框架的源代码,包括CREST计算和错误分类法分析,可在https://github.com/guanharry/VLM-CREST上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ae3c/12338887/bef0818fae6a/nihpp-2025.07.13.25331222v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验