Mehrabian Hatef, Brodbeck Jens, Lyu Peipei, Vaquero Edith, Aggarwal Abhishek, Diehl Lauri
Non-Clinical Safety and Pathobiology, Gilead Sciences, Foster City, CA, USA.
Sci Rep. 2024 Sep 16;14(1):21643. doi: 10.1038/s41598-024-69244-3.
The main bottleneck in training a robust tumor segmentation algorithm for non-small cell lung cancer (NSCLC) on H&E is generating sufficient ground truth annotations. Various approaches for generating tumor labels to train a tumor segmentation model was explored. A large dataset of low-cost low-accuracy panCK-based annotations was used to pre-train the model and determine the minimum required size of the expensive but highly accurate pathologist annotations dataset. PanCK pre-training was compared to foundation models and various architectures were explored for model backbone. Proper study design and sample procurement for training a generalizable model that captured variations in NSCLC H&E was studied. H&E imaging was performed on 112 samples (three centers, two scanner types, different staining and imaging protocols). Attention U-Net architecture was trained using the large panCK-based annotations dataset (68 samples, total area 10,326 [mm]) followed by fine-tuning using a small pathologist annotations dataset (80 samples, total area 246 [mm]). This approach resulted in mean intersection over union (mIoU) of 82% [77 87]. Using panCK pretraining provided better performance compared to foundation models and allowed for 70% reduction in pathologist annotations with no drop in performance. Study design ensured model generalizability over variations on H&E where performance was consistent across centers, scanners, and subtypes.
在苏木精-伊红(H&E)染色切片上训练用于非小细胞肺癌(NSCLC)的强大肿瘤分割算法的主要瓶颈在于生成足够的真实标注。探索了多种生成肿瘤标签以训练肿瘤分割模型的方法。使用一个基于细胞角蛋白(panCK)的低成本、低准确性标注的大型数据集对模型进行预训练,并确定昂贵但高度准确的病理学家标注数据集所需的最小规模。将panCK预训练与基础模型进行比较,并探索了用于模型主干的各种架构。研究了用于训练能够捕捉NSCLC H&E染色切片变化的通用模型的适当研究设计和样本采集方法。对112个样本(来自三个中心,两种扫描仪类型,不同的染色和成像方案)进行了H&E成像。使用基于panCK的大型标注数据集(68个样本,总面积10326平方毫米)训练注意力U-Net架构,随后使用小型病理学家标注数据集(80个样本,总面积246平方毫米)进行微调。这种方法的平均交并比(mIoU)为82%[77, 87]。与基础模型相比,使用panCK预训练具有更好的性能,并且在性能不下降的情况下,病理学家标注的数量减少了70%。研究设计确保了模型在H&E染色切片变化方面的通用性,在不同中心、扫描仪和亚型之间性能保持一致。