Bock Christoph, Walter Jörn, Paulsen Martina, Lengauer Thomas
Max-Planck-Institut für Informatik, Saarbrücken, Germany.
PLoS Comput Biol. 2007 Jun;3(6):e110. doi: 10.1371/journal.pcbi.0030110. Epub 2007 May 2.
CpG islands were originally identified by epigenetic and functional properties, namely, absence of DNA methylation and frequent promoter association. However, this concept was quickly replaced by simple DNA sequence criteria, which allowed for genome-wide annotation of CpG islands in the absence of large-scale epigenetic datasets. Although widely used, the current CpG island criteria incur significant disadvantages: (1) reliance on arbitrary threshold parameters that bear little biological justification, (2) failure to account for widespread heterogeneity among CpG islands, and (3) apparent lack of specificity when applied to the human genome. This study is driven by the idea that a quantitative score of "CpG island strength" that incorporates epigenetic and functional aspects can help resolve these issues. We construct an epigenome prediction pipeline that links the DNA sequence of CpG islands to their epigenetic states, including DNA methylation, histone modifications, and chromatin accessibility. By training support vector machines on epigenetic data for CpG islands on human Chromosomes 21 and 22, we identify informative DNA attributes that correlate with open versus compact chromatin structures. These DNA attributes are used to predict the epigenetic states of all CpG islands genome-wide. Combining predictions for multiple epigenetic features, we estimate the inherent CpG island strength for each CpG island in the human genome, i.e., its inherent tendency to exhibit an open and transcriptionally competent chromatin structure. We extensively validate our results on independent datasets, showing that the CpG island strength predictions are applicable and informative across different tissues and cell types, and we derive improved maps of predicted "bona fide" CpG islands. The mapping of CpG islands by epigenome prediction is conceptually superior to identifying CpG islands by widely used sequence criteria since it links CpG island detection to their characteristic epigenetic and functional states. And it is superior to purely experimental epigenome mapping for CpG island detection since it abstracts from specific properties that are limited to a single cell type or tissue. In addition, using computational epigenetics methods we could identify high correlation between the epigenome and characteristics of the DNA sequence, a finding which emphasizes the need for a better understanding of the mechanistic links between genome and epigenome.
CpG岛最初是根据表观遗传学和功能特性来识别的,即不存在DNA甲基化且频繁与启动子相关联。然而,这一概念很快被简单的DNA序列标准所取代,该标准使得在缺乏大规模表观遗传数据集的情况下也能对全基因组的CpG岛进行注释。尽管目前的CpG岛标准被广泛使用,但它存在显著缺点:(1)依赖几乎没有生物学依据的任意阈值参数;(2)未能考虑CpG岛之间广泛存在的异质性;(3)应用于人类基因组时明显缺乏特异性。本研究的出发点是,纳入表观遗传学和功能方面的“CpG岛强度”定量评分有助于解决这些问题。我们构建了一个表观基因组预测流程,将CpG岛的DNA序列与其表观遗传状态联系起来,包括DNA甲基化、组蛋白修饰和染色质可及性。通过在人类21号和22号染色体上的CpG岛表观遗传数据上训练支持向量机,我们识别出与开放型和致密型染色质结构相关的信息丰富的DNA属性。这些DNA属性用于预测全基因组所有CpG岛的表观遗传状态。结合对多种表观遗传特征的预测,我们估计了人类基因组中每个CpG岛的固有CpG岛强度,即其呈现开放且具有转录活性的染色质结构的固有倾向。我们在独立数据集上广泛验证了我们的结果,表明CpG岛强度预测在不同组织和细胞类型中都是适用且信息丰富的,并且我们得出了改进后的预测“真正的”CpG岛图谱。通过表观基因组预测来绘制CpG岛在概念上优于通过广泛使用的序列标准来识别CpG岛,因为它将CpG岛检测与其特征性的表观遗传和功能状态联系起来。并且它在检测CpG岛方面优于纯粹的实验性表观基因组图谱绘制,因为它从局限于单一细胞类型或组织的特定属性中抽象出来。此外,使用计算表观遗传学方法,我们能够识别表观基因组与DNA序列特征之间的高度相关性,这一发现强调了更好地理解基因组与表观基因组之间机制联系的必要性。