Stein Caleb K, Qu Pingping, Epstein Joshua, Buros Amy, Rosenthal Adam, Crowley John, Morgan Gareth, Barlogie Bart
Myeloma Institute for Research and Therapy, University of Arkansas for Medical Sciences, Little Rock, AR, USA.
Cancer Research and Biostatistics, Seattle, WA, USA.
BMC Bioinformatics. 2015 Feb 25;16:63. doi: 10.1186/s12859-015-0478-3.
Gene expression profiling (GEP) via microarray analysis is a widely used tool for assessing risk and other patient diagnostics in clinical settings. However, non-biological factors such as systematic changes in sample preparation, differences in scanners, and other potential batch effects are often unavoidable in long-term studies and meta-analysis. In order to reduce the impact of batch effects on microarray data, Johnson, Rabinovic, and Li developed ComBat for use when combining batches of gene expression microarray data. We propose a modification to ComBat that centers data to the location and scale of a pre-determined, 'gold-standard' batch. This modified ComBat (M-Combat) is designed specifically in the context of meta-analysis and batch effect adjustment for use with predictive models that are validated and fixed on historical data from a 'gold-standard' batch.
We combined data from MIRT across two batches ('Old' and 'New' Kit sample preparation) as well as external data sets from the HOVON-65/GMMG-HD4 and MRC-IX trials into a combined set, first without transformation and then with both ComBat and M-ComBat transformations. Fixed and validated gene risk signatures developed at MIRT on the Old Kit standard (GEP5, GEP70, and GEP80 risk scores) were compared across these combined data sets. Both ComBat and M-ComBat eliminated all of the differences among probes caused by systematic batch effects (over 98% of all untransformed probes were significantly different by ANOVA with 0.01 q-value threshold reduced to zero significant probes with ComBat and M-ComBat). The agreement in mean and distribution of risk scores, as well as the proportion of high-risk subjects identified, coincided with the 'gold-standard' batch more with M-ComBat than with ComBat. The performance of risk scores improved overall using either ComBat or M-Combat; however, using M-ComBat and the original, optimal risk cutoffs allowed for greater ability in our study to identify smaller cohorts of high-risk subjects.
M-ComBat is a practical modification to an accepted method that offers greater power to control the location and scale of batch-effect adjusted data. M-ComBat allows for historical models to function as intended on future samples despite known, often unavoidable systematic changes to gene expression data.
通过微阵列分析进行基因表达谱分析(GEP)是临床环境中评估风险和进行其他患者诊断的广泛使用的工具。然而,在长期研究和荟萃分析中,诸如样本制备中的系统变化、扫描仪差异以及其他潜在批次效应等非生物学因素往往不可避免。为了减少批次效应对微阵列数据的影响,约翰逊、拉比诺维奇和李开发了ComBat,用于合并基因表达微阵列数据批次时使用。我们提出了对ComBat的一种修改,将数据集中到预先确定的“金标准”批次的位置和规模。这种改进的ComBat(M-ComBat)是专门在荟萃分析和批次效应调整的背景下设计的,用于与基于“金标准”批次的历史数据进行验证和固定的预测模型一起使用。
我们将来自两个批次(“旧”和“新”试剂盒样本制备)的MIRT数据以及来自HOVON-65/GMMG-HD4和MRC-IX试验的外部数据集合并为一个组合集,首先不进行转换,然后进行ComBat和M-ComBat转换。在这些合并的数据集中,比较了在旧试剂盒标准(GEP5、GEP70和GEP80风险评分)上在MIRT开发的固定且经过验证的基因风险特征。ComBat和M-ComBat都消除了由系统批次效应引起的所有探针差异(超过98%的所有未转换探针通过ANOVA在0.01的q值阈值下显著不同,使用ComBat和M-ComBat后降至零显著探针)。风险评分的均值和分布的一致性,以及识别出的高风险受试者的比例,与“金标准”批次相比,M-ComBat比ComBat更吻合。使用ComBat或M-ComBat总体上风险评分的性能都有所提高;然而,在我们的研究中,使用M-ComBat和原始的最佳风险临界值能够更有能力识别较小的高风险受试者队列。
M-ComBat是对一种公认方法的实际改进,它在控制批次效应调整数据的位置和规模方面具有更大的能力。M-ComBat允许历史模型在未来样本上按预期运行,尽管基因表达数据存在已知的、通常不可避免的系统变化。