Zhang Weiwei, Spector Tim D, Deloukas Panos, Bell Jordana T, Engelhardt Barbara E
Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA.
Department of Twin Research and Genetic Epidemiology, King's College London, London, UK.
Genome Biol. 2015 Jan 24;16(1):14. doi: 10.1186/s13059-015-0581-9.
Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is critical to enable genome-wide analyses, but current approaches tackle average methylation within a locus and are often limited to specific genomic regions.
We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict methylation levels at CpG site resolution using features including neighboring CpG site methylation levels and genomic distance, co-localization with coding regions, CpG islands (CGIs), and regulatory elements from the ENCODE project. Our approach achieves 92% prediction accuracy of genome-wide methylation levels at single-CpG-site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity. Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.
Our observations of DNA methylation patterns led us to develop a classifier to predict DNA methylation levels at CpG site resolution with high accuracy. Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.
近期针对个体特异性全基因组DNA甲基化谱的检测方法,使得表观基因组范围的关联研究能够识别与某一表型相关的特定CpG位点。对CpG位点特异性甲基化水平进行计算预测,对于实现全基因组分析至关重要,但目前的方法处理的是一个基因座内的平均甲基化情况,并且通常局限于特定的基因组区域。
我们对全基因组DNA甲基化模式进行了表征,并表明CpG位点之间的相关性迅速衰减,这使得仅基于邻近位点进行预测具有挑战性。我们构建了一个随机森林分类器,使用包括邻近CpG位点甲基化水平、基因组距离、与编码区的共定位、CpG岛(CGIs)以及来自ENCODE项目的调控元件等特征,在CpG位点分辨率下预测甲基化水平。我们的方法在单CpG位点精度上实现了全基因组甲基化水平92%的预测准确率。当局限于CGIs内的CpG位点时,准确率提高到98%,并且在不同平台和细胞类型异质性方面具有稳健性。我们的分类器优于其他类型的分类器,并识别出有助于提高预测准确率的特征:发现邻近CpG位点甲基化、CGIs、共定位的DNase I超敏位点、转录因子结合位点以及组蛋白修饰对甲基化水平的预测性最强。
我们对DNA甲基化模式的观察促使我们开发了一种分类器,能够高精度地在CpG位点分辨率下预测DNA甲基化水平。此外,我们的方法识别出了与DNA甲基化相互作用的基因组特征,提示了DNA甲基化修饰和调控所涉及的机制,并将不同的表观遗传过程联系起来。