Department of Computer Science, University of North Carolina at Charlotte, Charlotte, NC, 28262, USA.
Sensebrain Research, San Jose, CA, 95131, USA.
Sci Data. 2023 Apr 21;10(1):231. doi: 10.1038/s41597-023-02125-y.
The success of training computer-vision models heavily relies on the support of large-scale, real-world images with annotations. Yet such an annotation-ready dataset is difficult to curate in pathology due to the privacy protection and excessive annotation burden. To aid in computational pathology, synthetic data generation, curation, and annotation present a cost-effective means to quickly enable data diversity that is required to boost model performance at different stages. In this study, we introduce a large-scale synthetic pathological image dataset paired with the annotation for nuclei semantic segmentation, termed as Synthetic Nuclei and annOtation Wizard (SNOW). The proposed SNOW is developed via a standardized workflow by applying the off-the-shelf image generator and nuclei annotator. The dataset contains overall 20k image tiles and 1,448,522 annotated nuclei with the CC-BY license. We show that SNOW can be used in both supervised and semi-supervised training scenarios. Extensive results suggest that synthetic-data-trained models are competitive under a variety of model training settings, expanding the scope of better using synthetic images for enhancing downstream data-driven clinical tasks.
训练计算机视觉模型的成功在很大程度上依赖于具有标注的大规模真实世界图像的支持。然而,由于隐私保护和过多的标注负担,病理领域很难创建这样一个标注就绪的数据集。为了辅助计算病理学,合成数据的生成、管理和标注提供了一种具有成本效益的手段,可以快速实现所需的数据多样性,从而提高不同阶段的模型性能。在这项研究中,我们引入了一个大规模的合成病理图像数据集,并为核语义分割提供了标注,称为合成核和标注向导(SNOW)。所提出的 SNOW 是通过应用现成的图像生成器和核标注器来遵循标准化工作流程开发的。该数据集包含总共 20k 个图像块和 1448522 个带有 CC-BY 许可证的标注核。我们表明,SNOW 可以用于监督和半监督训练场景。广泛的结果表明,在各种模型训练设置下,基于合成数据训练的模型具有竞争力,可以扩大更好地利用合成图像来增强下游数据驱动临床任务的范围。