Ling Xitong, Lei Yuanyuan, Li Jiawen, Cheng Junru, Huang Wenting, Guan Tian, Guan Jian, He Yonghong
Shenzhen International Graduate School, Tsinghua University, Shenzhen, 518071, China.
National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shenzhen, 518116, China.
Sci Data. 2025 Aug 7;12(1):1381. doi: 10.1038/s41597-025-05586-5.
Advances in optical microscopy scanning have significantly contributed to computational pathology (CPath) by converting traditional histopathological slides into whole slide images (WSIs). This development enables comprehensive digital reviews by pathologists and accelerates AI-driven diagnostic support for WSI analysis. Recent advances in foundational pathology models have increased the need for benchmarking tasks. The Camelyon series is one of the most widely used open-source datasets in computational pathology. However, the quality, accessibility, and clinical relevance of the labels have not been comprehensively evaluated.In this study, we reprocessed 1,399 WSIs and labels from the Camelyon-16 and Camelyon-17 datasets, removing low-quality slides, correcting erroneous labels, and providing expert pixel annotations for tumor regions in the previously unreleased test set. Based on the sizes of re-annotated tumor regions, we upgraded the binary cancer screening task to a four-class task: negative, micro-metastasis, macro-metastasis, and Isolated Tumor Cells (ITC). We reevaluated pre-trained pathology feature extractors and multiple instance learning (MIL) methods using the cleaned dataset, providing a benchmark that advances AI development in histopathology.
光学显微镜扫描技术的进步通过将传统组织病理学切片转换为全切片图像(WSIs),为计算病理学(CPath)做出了重大贡献。这一发展使病理学家能够进行全面的数字审查,并加速了对WSI分析的人工智能驱动的诊断支持。基础病理学模型的最新进展增加了对基准测试任务的需求。Camelyon系列是计算病理学中使用最广泛的开源数据集之一。然而,标签的质量、可及性和临床相关性尚未得到全面评估。在本研究中,我们对来自Camelyon-16和Camelyon-17数据集的1399个WSIs和标签进行了重新处理,去除了低质量切片,纠正了错误标签,并为之前未发布的测试集中的肿瘤区域提供了专家像素注释。基于重新标注的肿瘤区域大小,我们将二元癌症筛查任务升级为四类任务:阴性、微转移、宏转移和孤立肿瘤细胞(ITC)。我们使用清理后的数据集重新评估了预训练的病理学特征提取器和多实例学习(MIL)方法,提供了一个推动组织病理学人工智能发展的基准。