1Nutritional Methodology and Biostatistics Group, International Agency for Research on Cancer (IARC), World Health Organization, 150 cours Albert Thomas, 69372 Lyon CEDEX 08, France.
2Epigenetics Group, IARC, Lyon, France.
Clin Epigenetics. 2018 Mar 21;10:38. doi: 10.1186/s13148-018-0471-6. eCollection 2018.
Methylation measures quantified by microarray techniques can be affected by systematic variation due to the technical processing of samples, which may compromise the accuracy of the measurement process and contribute to bias the estimate of the association under investigation. The quantification of the contribution of the systematic source of variation is challenging in datasets characterized by hundreds of thousands of features.In this study, we introduce a method previously developed for the analysis of metabolomics data to evaluate the performance of existing normalizing techniques to correct for unwanted variation. Illumina Infinium HumanMethylation450K was used to acquire methylation levels in over 421,000 CpG sites for 902 study participants of a case-control study on breast cancer nested within the EPIC cohort. The principal component partial R-square (PC-PR2) analysis was used to identify and quantify the variability attributable to potential systematic sources of variation. Three correcting techniques, namely ComBat, surrogate variables analysis (SVA) and a linear regression model to compute residuals were applied. The impact of each correcting method on the association between smoking status and DNA methylation levels was evaluated, and results were compared with findings from a large meta-analysis.
A sizeable proportion of systematic variability due to variables expressing 'batch' and 'sample position' within 'chip' was identified, with values of the partial R statistics equal to 9.5 and 11.4% of total variation, respectively. After application of ComBat or the residuals' methods, the contribution was 1.3 and 0.2%, respectively. The SVA technique resulted in a reduced variability due to 'batch' (1.3%) and 'sample position' (0.6%), and in a diminished variability attributable to 'chip' within a batch (0.9%). After ComBat or the residuals' corrections, a larger number of significant sites ( = 600 and = 427, respectively) were associated to smoking status than the SVA correction ( = 96).
The three correction methods removed systematic variation in DNA methylation data, as assessed by the PC-PR2, which lent itself as a useful tool to explore variability in large dimension data. SVA produced more conservative findings than ComBat in the association between smoking and DNA methylation.
通过微阵列技术定量的甲基化测量可能会受到由于样品技术处理引起的系统变化的影响,这可能会影响测量过程的准确性,并导致对所研究的关联的估计产生偏差。在具有数十万特征的数据集,量化系统变异源的贡献具有挑战性。在这项研究中,我们引入了一种先前开发用于代谢组学数据分析的方法,以评估现有标准化技术校正不需要的变异的性能。Illumina Infinium HumanMethylation450K 用于获取 902 名乳腺癌病例对照研究参与者(嵌套在 EPIC 队列中)的超过 421,000 个 CpG 位点的甲基化水平。主成分部分 R 平方(PC-PR2)分析用于识别和量化归因于潜在系统源变异的变异性。应用了三种校正技术,即 ComBat、替代变量分析(SVA)和用于计算残差的线性回归模型。评估了每种校正方法对吸烟状态与 DNA 甲基化水平之间关联的影响,并将结果与大型荟萃分析的研究结果进行了比较。
确定了相当大比例的由于变量表达“批次”和“芯片内的样本位置”的系统变异,其部分 R 统计量分别等于总变异的 9.5%和 11.4%。应用 ComBat 或残差方法后,贡献分别为 1.3%和 0.2%。SVA 技术导致由于“批次”(1.3%)和“样本位置”(0.6%)的变异性降低,以及由于批次内的“芯片”(0.9%)而导致的可归因于“芯片”的变异性降低。在 ComBat 或残差校正后,与吸烟状态相关的显著位点数量(=600 和=427,分别)多于 SVA 校正(=96)。
三种校正方法去除了 DNA 甲基化数据中的系统变异,如 PC-PR2 所评估的,这为探索大维度数据中的变异性提供了有用的工具。与 ComBat 相比,SVA 在吸烟与 DNA 甲基化之间的关联中产生了更保守的结果。