Jiang Shuai, Hondelink Liesbeth, Suriawinata Arief A, Hassanpour Saeed
Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA.
Department of Pathology and Laboratory Medicine, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA.
J Pathol Inform. 2024 May 31;15:100386. doi: 10.1016/j.jpi.2024.100386. eCollection 2024 Dec.
In digital pathology, whole-slide images (WSIs) are widely used for applications such as cancer diagnosis and prognosis prediction. Vision transformer (ViT) models have recently emerged as a promising method for encoding large regions of WSIs while preserving spatial relationships among patches. However, due to the large number of model parameters and limited labeled data, applying transformer models to WSIs remains challenging. In this study, we propose a pretext task to train the transformer model in a self-supervised manner. Our model, MaskHIT, uses the transformer output to reconstruct masked patches, measured by contrastive loss. We pre-trained MaskHIT model using over 7000 WSIs from TCGA and extensively evaluated its performance in multiple experiments, covering survival prediction, cancer subtype classification, and grade prediction tasks. Our experiments demonstrate that the pre-training procedure enables context-aware understanding of WSIs, facilitates the learning of representative histological features based on patch positions and visual patterns, and is essential for the ViT model to achieve optimal results on WSI-level tasks. The pre-trained MaskHIT surpasses various multiple instance learning approaches by 3% and 2% on survival prediction and cancer subtype classification tasks, and also outperforms recent state-of-the-art transformer-based methods. Finally, a comparison between the attention maps generated by the MaskHIT model with pathologist's annotations indicates that the model can accurately identify clinically relevant histological structures on the whole slide for each task.
在数字病理学中,全切片图像(WSIs)被广泛应用于癌症诊断和预后预测等领域。视觉Transformer(ViT)模型最近成为一种很有前景的方法,用于对WSIs的大区域进行编码,同时保留各图像块之间的空间关系。然而,由于模型参数数量众多且标记数据有限,将Transformer模型应用于WSIs仍然具有挑战性。在本研究中,我们提出了一个自监督的预训练任务来训练Transformer模型。我们的模型MaskHIT利用Transformer输出重建被掩盖的图像块,通过对比损失进行衡量。我们使用来自TCGA的7000多张WSIs对MaskHIT模型进行了预训练,并在多个实验中广泛评估了其性能,这些实验涵盖生存预测、癌症亚型分类和分级预测任务。我们的实验表明,预训练过程能够实现对WSIs的上下文感知理解,有助于基于图像块位置和视觉模式学习代表性的组织学特征,并且对于ViT模型在WSI级任务上取得最佳结果至关重要。在生存预测和癌症亚型分类任务中,预训练的MaskHIT比各种多实例学习方法分别高出3%和2%,并且还优于最近基于Transformer的先进方法。最后,MaskHIT模型生成的注意力图与病理学家注释之间的比较表明,该模型能够在整张切片上准确识别每个任务的临床相关组织学结构。