Division of Biostatistics, College of Public Health, Ohio State University, Columbus, OH.
Department of Population Health, New York University, New York, NY.
JCO Clin Cancer Inform. 2023 Jun;7:e2200138. doi: 10.1200/CCI.22.00138.
Reproducible translation of transcriptomics data has been hampered by the ubiquitous presence of batch effects. Statistical methods for managing batch effects were initially developed in the setting of sample group comparison and later borrowed for other settings such as survival outcome prediction. The most notable such method is ComBat, which adjusts for batches by including it as a covariate alongside sample groups in a linear regression. In survival prediction, however, ComBat is used without definable groups for survival outcome and is done sequentially with survival regression for a potentially batch-confounded outcome. To address these issues, we propose a new method called BATch MitigAtion via stratificatioN (BatMan). It adjusts batches as strata in survival regression and uses variable selection methods such as the regularized regression to handle high dimensionality. We assess the performance of BatMan in comparison with ComBat, each used either alone or in conjunction with data normalization, in a resampling-based simulation study under various levels of predictive signal strength and patterns of batch-outcome association. Our simulations show that (1) BatMan outperforms ComBat in nearly all scenarios when there are batch effects in the data and (2) their performance can be worsened by the addition of data normalization. We further evaluate them using microRNA data for ovarian cancer from the Cancer Genome Atlas and find that BatMan outforms ComBat while the addition of data normalization worsens the prediction. Our study thus shows the advantage of BatMan and raises caution about the use of data normalization in the context of developing survival prediction models. The BatMan method and the simulation tool for performance assessment are implemented in R and publicly available at LXQin/PRECISION.survival-GitHub.
转录组数据的可重现翻译一直受到批次效应普遍存在的阻碍。用于管理批次效应的统计方法最初是在样本组比较的背景下开发的,后来被借用到其他环境中,例如生存结果预测。最著名的方法是 ComBat,它通过将批次作为协变量与样本组一起包含在线性回归中,从而调整批次。然而,在生存预测中,ComBat 没有为生存结果定义可定义的组,并且与生存回归一起顺序进行,以避免潜在的批次混淆结果。为了解决这些问题,我们提出了一种名为通过分层(BatMan)进行批次缓解的新方法。它将批次调整为生存回归中的分层,并使用变量选择方法(如正则化回归)来处理高维数据。我们在基于重采样的模拟研究中评估了 BatMan 与 ComBat 的性能,每个方法都单独使用或与数据归一化一起使用,在各种预测信号强度和批次-结果关联模式下进行。我们的模拟表明:(1)当数据中存在批次效应时,BatMan 在几乎所有情况下都优于 ComBat;(2)添加数据归一化会使它们的性能恶化。我们进一步使用癌症基因组图谱(Cancer Genome Atlas)中来自卵巢癌的 microRNA 数据评估它们,并发现 BatMan 优于 ComBat,而添加数据归一化则会降低预测效果。因此,我们的研究表明了 BatMan 的优势,并对在开发生存预测模型的背景下使用数据归一化提出了警告。BatMan 方法和性能评估的模拟工具已在 R 中实现,并在 LXQin/PRECISION.survival-GitHub 上公开提供。