Bose Maitreyee, Wu Chong, Pankow James S, Demerath Ellen W, Bressler Jan, Fornage Myriam, Grove Megan L, Mosley Thomas H, Hicks Chindo, North Kari, Kao Wen Hong, Zhang Yu, Boerwinkle Eric, Guan Weihua
Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA.
BMC Bioinformatics. 2014 Sep 19;15(1):312. doi: 10.1186/1471-2105-15-312.
DNA methylation is a widely studied epigenetic phenomenon; alterations in methylation patterns influence human phenotypes and risk of disease. As part of the Atherosclerosis Risk in Communities (ARIC) study, the Illumina Infinium HumanMethylation450 (HM450) BeadChip was used to measure DNA methylation in peripheral blood obtained from ~3000 African American study participants. Over 480,000 cytosine-guanine (CpG) dinucleotide sites were surveyed on the HM450 BeadChip. To evaluate the impact of technical variation, 265 technical replicates from 130 participants were included in the study.
For each CpG site, we calculated the intraclass correlation coefficient (ICC) to compare variation of methylation levels within- and between-replicate pairs, ranging between 0 and 1. We modeled the distribution of ICC as a mixture of censored or truncated normal and normal distributions using an EM algorithm. The CpG sites were clustered into low- and high-reliability groups, according to the calculated posterior probabilities. We also demonstrated the performance of this clustering when applied to a study of association between methylation levels and smoking status of individuals. For the CpG sites showing genome-wide significant association with smoking status, most (~96%) were seen from sites in the high reliability cluster.
We suggest that CpG sites with low ICC may be excluded from subsequent association analyses, or extra caution needs to be taken for associations at such sites.
DNA甲基化是一种得到广泛研究的表观遗传现象;甲基化模式的改变会影响人类表型和疾病风险。作为社区动脉粥样硬化风险(ARIC)研究的一部分,Illumina Infinium HumanMethylation450(HM450)芯片被用于测量从约3000名非裔美国研究参与者采集的外周血中的DNA甲基化。HM450芯片上对超过48万个胞嘧啶-鸟嘌呤(CpG)二核苷酸位点进行了检测。为评估技术变异的影响,研究纳入了来自130名参与者的265个技术重复样本。
对于每个CpG位点,我们计算了组内相关系数(ICC),以比较重复样本对内和对间甲基化水平的变异,范围在0到1之间。我们使用期望最大化(EM)算法将ICC的分布建模为删失或截断正态分布与正态分布的混合。根据计算出的后验概率,将CpG位点聚类为低可靠性组和高可靠性组。我们还展示了这种聚类在应用于甲基化水平与个体吸烟状态之间关联研究时的表现。对于显示与吸烟状态有全基因组显著关联的CpG位点,大多数(约96%)见于高可靠性聚类中的位点。
我们建议,ICC低的CpG位点可能应从后续关联分析中排除,或者对此类位点的关联需要格外谨慎。