利用隐私保护的大型语言模型和多类型标注增强胸部 X 光数据集：一种用于提高分类性能的数据驱动方法。

Enhancing chest X-ray datasets with privacy-preserving large language models and multi-type annotations: A data-driven approach for improved classification.

机构信息

Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Department of Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bldg 10, Room 1C224D, 10 Center Dr, Bethesda, MD 20892-1182, USA.

出版信息

Med Image Anal. 2025 Jan;99:103383. doi: 10.1016/j.media.2024.103383. Epub 2024 Nov 10.

DOI:10.1016/j.media.2024.103383

PMID:39546982

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11609015/

Abstract

In chest X-ray (CXR) image analysis, rule-based systems are usually employed to extract labels from reports for dataset releases. However, there is still room for improvement in label quality. These labelers typically output only presence labels, sometimes with binary uncertainty indicators, which limits their usefulness. Supervised deep learning models have also been developed for report labeling but lack adaptability, similar to rule-based systems. In this work, we present MAPLEZ (Medical report Annotations with Privacy-preserving Large language model using Expeditious Zero shot answers), a novel approach leveraging a locally executable Large Language Model (LLM) to extract and enhance findings labels on CXR reports. MAPLEZ extracts not only binary labels indicating the presence or absence of a finding but also the location, severity, and radiologists' uncertainty about the finding. Over eight abnormalities from five test sets, we show that our method can extract these annotations with an increase of 3.6 percentage points (pp) in macro F1 score for categorical presence annotations and more than 20 pp increase in F1 score for the location annotations over competing labelers. Additionally, using the combination of improved annotations and multi-type annotations in classification supervision in a dataset of limited-resolution CXRs, we demonstrate substantial advancements in proof-of-concept classification quality, with an increase of 1.1 pp in AUROC over models trained with annotations from the best alternative approach. We share code and annotations.

摘要

在胸部 X 光（CXR）图像分析中，通常使用基于规则的系统从报告中提取标签以发布数据集。然而，标签的质量仍然有改进的空间。这些标签器通常仅输出存在标签，有时带有二进制不确定性指标，这限制了它们的用途。也已经为报告标记开发了基于监督的深度学习模型，但与基于规则的系统类似，它们缺乏适应性。在这项工作中，我们提出了 MAPLEZ（利用具有 Expeditious Zero shot answers 的隐私保护大型语言模型进行医学报告标注），这是一种利用本地可执行的大型语言模型（LLM）从 CXR 报告中提取和增强发现标签的新方法。MAPLEZ 不仅提取了指示发现存在或不存在的二进制标签，还提取了发现的位置、严重程度以及放射科医生对发现的不确定性。在五个测试集中的八种异常中，我们表明，我们的方法可以提取这些注释，对于类别存在注释，宏 F1 得分提高了 3.6 个百分点（pp），对于位置注释，F1 得分提高了 20 多个百分点，超过了竞争标签器。此外，在有限分辨率 CXR 数据集的分类监督中使用改进的注释和多类型注释的组合，我们在概念验证分类质量方面取得了实质性的进展，与使用最佳替代方法的注释训练的模型相比，AUROC 提高了 1.1 个百分点。我们共享代码和注释。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

利用隐私保护的大型语言模型和多类型标注增强胸部 X 光数据集：一种用于提高分类性能的数据驱动方法。

Enhancing chest X-ray datasets with privacy-preserving large language models and multi-type annotations: A data-driven approach for improved classification.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

利用隐私保护的大型语言模型和多类型标注增强胸部 X 光数据集：一种用于提高分类性能的数据驱动方法。

Enhancing chest X-ray datasets with privacy-preserving large language models and multi-type annotations: A data-driven approach for improved classification.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献