Center for Systems and Computational Biology, Molecular and Cellular Oncogenesis Program, The Wistar Institute, Philadelphia, PA, USA.
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S65. doi: 10.1186/1471-2105-11-S1-S65.
Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. However, these methods produce enrichment not only near the gene promoters but also inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. Further, the use of these methods is limited by their high cost and strong dependence on cellular type and context.
We trained and tested different state-of-art ensemble and meta classification methods for identification of Pol-II enriched promoter and Pol-II enriched non-promoter sequences, each of length 500 bp. The classification models were trained and tested on a bench-mark dataset, using a set of 39 different feature variables that are based on chromatin modification signatures and various DNA sequence features. The best performing model was applied on seven published ChIP-seq Pol-II datasets to provide genome wide annotation of mouse gene promoters.
We present a novel algorithm based on supervised learning methods to discriminate promoter associated Pol-II enrichment from enrichment elsewhere in the genome in ChIP-chip/seq profiles. We accumulated a dataset of 11,773 promoter and 46,167 non-promoter sequences, each of length 500 bp, generated from RNA Pol-II ChIP-seq data of five tissues (Brain, Kidney, Liver, Lung and Spleen). We evaluated the classification models in building the best predictor and found that Bagging and Random Forest based approaches give the best accuracy. We implemented the algorithm on seven different published ChIP-seq datasets to provide a comprehensive set of promoter annotations for both protein-coding and non-coding genes in the mouse genome. The resulting annotations contain 13,413 (4,747) protein-coding (non-coding) genes with single promoters and 9,929 (1,858) protein-coding (non-coding) genes with two or more alternative promoters, and a significant number of unassigned novel promoters.
Our new algorithm can successfully predict the promoters from the genome wide profile of Pol-II bound regions. In addition, our algorithm performs significantly better than existing promoter prediction methods and can be applied for genome-wide predictions of Pol-II promoters.
在哺乳动物基因组中,使用替代基因启动子来驱动广泛的细胞类型、组织类型或发育基因调控是一种常见现象。使用针对 Pol-II 的抗体,结合染色质免疫沉淀方法(ChIP-chip)或大规模平行测序(ChIP-seq),可以在不同的细胞条件下实现全基因组范围内活性启动子的鉴定。然而,这些方法不仅在基因启动子附近产生富集,而且由于 ChIP 中使用的抗体的非特异性,还会在基因内部和其他基因组区域产生富集。此外,这些方法的使用受到其高成本和对细胞类型和背景的强烈依赖的限制。
我们针对 Pol-II 富集启动子和 Pol-II 富集非启动子序列(每个长度为 500bp)的鉴定,训练和测试了不同的最先进的集成和元分类方法。分类模型使用基于染色质修饰特征和各种 DNA 序列特征的 39 种不同特征变量的基准数据集进行训练和测试。最佳模型应用于七个已发表的 ChIP-seq Pol-II 数据集,为小鼠基因启动子提供全基因组注释。
我们提出了一种基于监督学习方法的新算法,用于区分 ChIP-chip/seq 图谱中与基因组中其他部位相关的 Pol-II 富集启动子。我们积累了一个由 11773 个启动子和 46167 个非启动子序列组成的数据集,每个序列长度为 500bp,来自五个组织(脑、肾、肝、肺和脾)的 RNA Pol-II ChIP-seq 数据。我们评估了分类模型在构建最佳预测器方面的性能,并发现基于 Bagging 和随机森林的方法提供了最高的准确性。我们在七个不同的已发表的 ChIP-seq 数据集上实现了该算法,为小鼠基因组中的蛋白质编码和非编码基因提供了全面的启动子注释。生成的注释包含 13413(4747)个具有单个启动子的蛋白质编码(非编码)基因和 9929(1858)个具有两个或更多替代启动子的蛋白质编码(非编码)基因,以及大量未分配的新启动子。
我们的新算法可以从 Pol-II 结合区域的全基因组图谱中成功预测启动子。此外,我们的算法明显优于现有的启动子预测方法,可以应用于 Pol-II 启动子的全基因组预测。