Institute for Signals, Systems and Computational Intelligence, sinc(i) CONICET-UNL, Santa Fe, CP 3002, Argentina.
Health Informatics Department at Hospital Italiano de Buenos Aires, Buenos Aires, CP 1199, Argentina.
Sci Data. 2024 May 17;11(1):511. doi: 10.1038/s41597-024-03358-1.
The development of successful artificial intelligence models for chest X-ray analysis relies on large, diverse datasets with high-quality annotations. While several databases of chest X-ray images have been released, most include disease diagnosis labels but lack detailed pixel-level anatomical segmentation labels. To address this gap, we introduce an extensive chest X-ray multi-center segmentation dataset with uniform and fine-grain anatomical annotations for images coming from five well-known publicly available databases: ChestX-ray8, CheXpert, MIMIC-CXR-JPG, Padchest, and VinDr-CXR, resulting in 657,566 segmentation masks. Our methodology utilizes the HybridGNet model to ensure consistent and high-quality segmentations across all datasets. Rigorous validation, including expert physician evaluation and automatic quality control, was conducted to validate the resulting masks. Additionally, we provide individualized quality indices per mask and an overall quality estimation per dataset. This dataset serves as a valuable resource for the broader scientific community, streamlining the development and assessment of innovative methodologies in chest X-ray analysis.
成功开发用于胸部 X 射线分析的人工智能模型依赖于具有高质量注释的大型、多样化数据集。虽然已经发布了几个胸部 X 射线图像数据库,但大多数数据库都包含疾病诊断标签,但缺乏详细的像素级解剖分割标签。为了解决这一差距,我们引入了一个广泛的胸部 X 射线多中心分割数据集,具有来自五个知名公开可用数据库的统一和细粒度解剖注释:ChestX-ray8、CheXpert、MIMIC-CXR-JPG、Padchest 和 VinDr-CXR,总共产生了 657566 个分割掩模。我们的方法利用 HybridGNet 模型来确保所有数据集的分割结果一致且具有高质量。我们进行了严格的验证,包括专家医生评估和自动质量控制,以验证生成的掩模。此外,我们还为每个掩模提供了个性化的质量指数,并为每个数据集提供了整体质量估计。这个数据集为更广泛的科学界提供了有价值的资源,简化了胸部 X 射线分析中创新方法的开发和评估。