Suppr超能文献

使用ComBat进行深度特征批量校正以用于计算病理学中的机器学习应用。

Deep feature batch correction using ComBat for machine learning applications in computational pathology.

作者信息

Murchan Pierre, Ó Broin Pilib, Baird Anne-Marie, Sheils Orla, P Finn Stephen

机构信息

Department of Histopathology and Morbid Anatomy, Trinity Translational Medicine Institute, Trinity College Dublin, Dublin D08 W9RT, Ireland.

The SFI Centre for Research Training in Genomics Data Science, Dublin, Ireland.

出版信息

J Pathol Inform. 2024 Sep 12;15:100396. doi: 10.1016/j.jpi.2024.100396. eCollection 2024 Dec.

Abstract

BACKGROUND

Developing artificial intelligence (AI) models for digital pathology requires large datasets from multiple sources. However, without careful implementation, AI models risk learning confounding site-specific features in datasets instead of clinically relevant information, leading to overestimated performance, poor generalizability to real-world data, and potential misdiagnosis.

METHODS

Whole-slide images (WSIs) from The Cancer Genome Atlas (TCGA) colon (COAD), and stomach adenocarcinoma datasets were selected for inclusion in this study. Patch embeddings were obtained using three feature extraction models, followed by ComBat harmonization. Attention-based multiple instance learning models were trained to predict tissue-source site (TSS), as well as clinical and genetic attributes, using raw, Macenko normalized, and Combat-harmonized patch embeddings.

RESULTS

TSS prediction achieved high accuracy (AUROC > 0.95) with all three feature extraction models. ComBat harmonization significantly reduced the AUROC for TSS prediction, with mean AUROCs dropping to approximately 0.5 for most models, indicating successful mitigation of batch effects (e.g., CCL-ResNet50 in TCGA-COAD: Pre-ComBat AUROC = 0.960, Post-ComBat AUROC = 0.506, 0.001). Clinical attributes associated with TSS, such as race and treatment response, showed decreased predictability post-harmonization. Notably, the prediction of genetic features like MSI status remained robust after harmonization (e.g., MSI in TCGA-COAD: Pre-ComBat AUROC = 0.667, Post-ComBat AUROC = 0.669, =0.952), indicating the preservation of true histological signals.

CONCLUSION

ComBat harmonization of deep learning-derived histology features effectively reduces the risk of AI models learning confounding features in WSIs, ensuring more reliable performance estimates. This approach is promising for the integration of large-scale digital pathology datasets.

摘要

背景

开发用于数字病理学的人工智能(AI)模型需要来自多个来源的大型数据集。然而,如果实施不当,AI模型可能会在数据集中学习到混淆的特定部位特征,而非临床相关信息,从而导致性能高估、对真实世界数据的泛化性差以及潜在的误诊。

方法

本研究选取了来自癌症基因组图谱(TCGA)结肠(COAD)和胃腺癌数据集的全切片图像(WSI)。使用三种特征提取模型获得图像块嵌入,随后进行ComBat归一化处理。基于注意力的多实例学习模型使用原始的、经过Macenko归一化处理的以及ComBat归一化后的图像块嵌入,来训练预测组织来源部位(TSS)以及临床和遗传属性。

结果

使用所有三种特征提取模型进行TSS预测均取得了较高的准确率(曲线下面积[AUC] > 0.95)。ComBat归一化显著降低了TSS预测的AUC,大多数模型的平均AUC降至约0.5,表明批次效应得到了成功缓解(例如,TCGA-COAD中的CCL-ResNet50:ComBat处理前AUC = 0.960,ComBat处理后AUC = 0.506,P < 0.001)。与TSS相关的临床属性,如种族和治疗反应,在归一化后显示出预测能力下降。值得注意的是,微卫星高度不稳定(MSI)状态等遗传特征在归一化后的预测仍然稳健(例如,TCGA-COAD中的MSI:ComBat处理前AUC = 0.667,ComBat处理后AUC = 0.669,P = 0.952),表明真实的组织学信号得以保留。

结论

对深度学习衍生的组织学特征进行ComBat归一化可有效降低AI模型在WSI中学习混淆特征的风险,确保更可靠的性能评估。这种方法对于大规模数字病理学数据集的整合具有前景。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b6a/11470259/3a40f5061ea4/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验