Campanella Gabriele, Chen Shengjia, Singh Manbir, Verma Ruchika, Muehlstedt Silke, Zeng Jennifer, Stock Aryeh, Croken Matt, Veremis Brandon, Elmas Abdulkadir, Shujski Ivan, Neittaanmäki Noora, Huang Kuan-Lin, Kwan Ricky, Houldsworth Jane, Schoenfeld Adam J, Vanderbilt Chad
Windreich Department of AI and Human Health, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA.
Hasso Plattner Institute at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA.
Nat Commun. 2025 Apr 17;16(1):3640. doi: 10.1038/s41467-025-58796-1.
The use of self-supervised learning to train pathology foundation models has increased substantially in the past few years. Notably, several models trained on large quantities of clinical data have been made publicly available in recent months. This will significantly enhance scientific research in computational pathology and help bridge the gap between research and clinical deployment. With the increase in availability of public foundation models of different sizes, trained using different algorithms on different datasets, it becomes important to establish a benchmark to compare the performance of such models on a variety of clinically relevant tasks spanning multiple organs and diseases. In this work, we present a collection of pathology datasets comprising clinical slides associated with clinically relevant endpoints including cancer diagnoses and a variety of biomarkers generated during standard hospital operation from three medical centers. We leverage these datasets to systematically assess the performance of public pathology foundation models and provide insights into best practices for training foundation models and selecting appropriate pretrained models. To enable the community to evaluate their models on our clinical datasets, we make available an automated benchmarking pipeline for external use.
在过去几年中,使用自监督学习来训练病理学基础模型的情况大幅增加。值得注意的是,最近几个月有几个在大量临床数据上训练的模型已公开可用。这将显著加强计算病理学的科学研究,并有助于弥合研究与临床应用之间的差距。随着不同规模、使用不同算法在不同数据集上训练的公共基础模型的可用性增加,建立一个基准来比较这些模型在跨越多个器官和疾病的各种临床相关任务上的性能变得很重要。在这项工作中,我们展示了一个病理学数据集集合,该集合包括与临床相关终点相关的临床幻灯片,这些终点包括癌症诊断以及来自三个医疗中心在标准医院手术期间生成的各种生物标志物。我们利用这些数据集系统地评估公共病理学基础模型的性能,并提供有关训练基础模型和选择合适的预训练模型的最佳实践的见解。为了使社区能够在我们的临床数据集上评估他们的模型,我们提供了一个供外部使用的自动化基准测试管道。