通过整合 ChIP-seq Pol-II 富集数据的数据挖掘对基因启动子进行注释。

Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data.

机构信息

Center for Systems and Computational Biology, Molecular and Cellular Oncogenesis Program, The Wistar Institute, Philadelphia, PA, USA.

出版信息

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S65. doi: 10.1186/1471-2105-11-S1-S65.

DOI:10.1186/1471-2105-11-S1-S65

PMID:20122241

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3009539/

Abstract

BACKGROUND

Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. However, these methods produce enrichment not only near the gene promoters but also inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. Further, the use of these methods is limited by their high cost and strong dependence on cellular type and context.

METHODS

We trained and tested different state-of-art ensemble and meta classification methods for identification of Pol-II enriched promoter and Pol-II enriched non-promoter sequences, each of length 500 bp. The classification models were trained and tested on a bench-mark dataset, using a set of 39 different feature variables that are based on chromatin modification signatures and various DNA sequence features. The best performing model was applied on seven published ChIP-seq Pol-II datasets to provide genome wide annotation of mouse gene promoters.

RESULTS

We present a novel algorithm based on supervised learning methods to discriminate promoter associated Pol-II enrichment from enrichment elsewhere in the genome in ChIP-chip/seq profiles. We accumulated a dataset of 11,773 promoter and 46,167 non-promoter sequences, each of length 500 bp, generated from RNA Pol-II ChIP-seq data of five tissues (Brain, Kidney, Liver, Lung and Spleen). We evaluated the classification models in building the best predictor and found that Bagging and Random Forest based approaches give the best accuracy. We implemented the algorithm on seven different published ChIP-seq datasets to provide a comprehensive set of promoter annotations for both protein-coding and non-coding genes in the mouse genome. The resulting annotations contain 13,413 (4,747) protein-coding (non-coding) genes with single promoters and 9,929 (1,858) protein-coding (non-coding) genes with two or more alternative promoters, and a significant number of unassigned novel promoters.

CONCLUSION

Our new algorithm can successfully predict the promoters from the genome wide profile of Pol-II bound regions. In addition, our algorithm performs significantly better than existing promoter prediction methods and can be applied for genome-wide predictions of Pol-II promoters.

摘要

背景

在哺乳动物基因组中，使用替代基因启动子来驱动广泛的细胞类型、组织类型或发育基因调控是一种常见现象。使用针对 Pol-II 的抗体，结合染色质免疫沉淀方法（ChIP-chip）或大规模平行测序（ChIP-seq），可以在不同的细胞条件下实现全基因组范围内活性启动子的鉴定。然而，这些方法不仅在基因启动子附近产生富集，而且由于 ChIP 中使用的抗体的非特异性，还会在基因内部和其他基因组区域产生富集。此外，这些方法的使用受到其高成本和对细胞类型和背景的强烈依赖的限制。

方法

我们针对 Pol-II 富集启动子和 Pol-II 富集非启动子序列（每个长度为 500bp）的鉴定，训练和测试了不同的最先进的集成和元分类方法。分类模型使用基于染色质修饰特征和各种 DNA 序列特征的 39 种不同特征变量的基准数据集进行训练和测试。最佳模型应用于七个已发表的 ChIP-seq Pol-II 数据集，为小鼠基因启动子提供全基因组注释。

结果

我们提出了一种基于监督学习方法的新算法，用于区分 ChIP-chip/seq 图谱中与基因组中其他部位相关的 Pol-II 富集启动子。我们积累了一个由 11773 个启动子和 46167 个非启动子序列组成的数据集，每个序列长度为 500bp，来自五个组织（脑、肾、肝、肺和脾）的 RNA Pol-II ChIP-seq 数据。我们评估了分类模型在构建最佳预测器方面的性能，并发现基于 Bagging 和随机森林的方法提供了最高的准确性。我们在七个不同的已发表的 ChIP-seq 数据集上实现了该算法，为小鼠基因组中的蛋白质编码和非编码基因提供了全面的启动子注释。生成的注释包含 13413（4747）个具有单个启动子的蛋白质编码（非编码）基因和 9929（1858）个具有两个或更多替代启动子的蛋白质编码（非编码）基因，以及大量未分配的新启动子。

结论

我们的新算法可以从 Pol-II 结合区域的全基因组图谱中成功预测启动子。此外，我们的算法明显优于现有的启动子预测方法，可以应用于 Pol-II 启动子的全基因组预测。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef77/3009539/5c4e3e726c18/1471-2105-11-S1-S65-1.jpg

相似文献

Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data.通过整合 ChIP-seq Pol-II 富集数据的数据挖掘对基因启动子进行注释。

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S65. doi: 10.1186/1471-2105-11-S1-S65.

Genome-wide mapping of RNA Pol-II promoter usage in mouse tissues by ChIP-seq.通过 ChIP-seq 技术在小鼠组织中进行全基因组范围内的 RNA Pol-II 启动子使用图谱绘制。

Nucleic Acids Res. 2011 Jan;39(1):190-201. doi: 10.1093/nar/gkq775. Epub 2010 Sep 14.

Genome-wide mapping of RNA Pol-II promoter usage in mouse tissues by ChIP-seq.通过染色质免疫沉淀测序（ChIP-seq）对小鼠组织中RNA聚合酶II启动子使用情况进行全基因组图谱绘制。

Methods Mol Biol. 2014;1176:1-9. doi: 10.1007/978-1-4939-0992-6_1.

Temporal ChIP-on-Chip of RNA-Polymerase-II to detect novel gene activation events during photoreceptor maturation.用于检测光感受器成熟过程中新型基因激活事件的RNA聚合酶II的时间芯片染色质免疫沉淀技术

Mol Vis. 2010 Feb 17;16:252-71.

Genome annotation test with validation on transcription start site and ChIP-Seq for Pol-II binding data.基因组注释测试，针对转录起始位点进行验证，并进行 Pol-II 结合数据的 ChIP-Seq 分析。

Bioinformatics. 2011 Jun 15;27(12):1610-7. doi: 10.1093/bioinformatics/btr263. Epub 2011 May 9.

Integrative genome-wide chromatin signature analysis using finite mixture models.基于有限混合模型的全基因组整合染色质特征分析。

BMC Genomics. 2012;13 Suppl 6(Suppl 6):S3. doi: 10.1186/1471-2164-13-S6-S3. Epub 2012 Oct 26.

MPromDb update 2010: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-seq experimental data.MPromDb 2010年更新：用于哺乳动物基因启动子注释与可视化以及ChIP-seq实验数据的综合资源。

Nucleic Acids Res. 2011 Jan;39(Database issue):D92-7. doi: 10.1093/nar/gkq1171. Epub 2010 Nov 21.

ChIP-chip versus ChIP-seq: lessons for experimental design and data analysis.ChIP-chip 与 ChIP-seq：实验设计和数据分析的经验教训。

BMC Genomics. 2011 Feb 28;12:134. doi: 10.1186/1471-2164-12-134.

ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data.ChIPpeakAnno：一个用于注释 ChIP-seq 和 ChIP-chip 数据的 Bioconductor 软件包。

BMC Bioinformatics. 2010 May 11;11:237. doi: 10.1186/1471-2105-11-237.

dPeak: high resolution identification of transcription factor binding sites from PET and SET ChIP-Seq data.dPeak：从 PET 和 SET ChIP-Seq 数据中高分辨率识别转录因子结合位点。

PLoS Comput Biol. 2013;9(10):e1003246. doi: 10.1371/journal.pcbi.1003246. Epub 2013 Oct 17.

引用本文的文献

The transcription factor Bcl11b promotes both canonical and adaptive NK cell differentiation.转录因子 Bcl11b 促进经典和适应性 NK 细胞分化。

Sci Immunol. 2021 Mar 12;6(57). doi: 10.1126/sciimmunol.abc9801.

Tumor-Based Genetic Testing and Familial Cancer Risk.基于肿瘤的基因检测与家族性癌症风险

Cold Spring Harb Perspect Med. 2020 Aug 3;10(8):a036590. doi: 10.1101/cshperspect.a036590.

Novel promoters and coding first exons in DLG2 linked to developmental disorders and intellectual disability.与发育障碍和智力残疾相关的DLG2中的新型启动子和编码首个外显子

Genome Med. 2017 Jul 19;9(1):67. doi: 10.1186/s13073-017-0452-y.

A novel sequence and context based method for promoter recognition.一种基于序列和上下文的新型启动子识别方法。

Bioinformation. 2014 Apr 23;10(4):175-9. doi: 10.6026/97320630010175. eCollection 2014.

Genomics and proteomics in solving brain complexity.基因组学和蛋白质组学在解决大脑复杂性方面的应用

Mol Biosyst. 2013 Jul;9(7):1807-21. doi: 10.1039/c3mb25391k. Epub 2013 Apr 24.

Isoform level expression profiles provide better cancer signatures than gene level expression profiles.异构体水平表达谱比基因水平表达谱提供更好的癌症特征。

Genome Med. 2013 Apr 17;5(4):33. doi: 10.1186/gm437. eCollection 2013.

Identification of regulatory regions of bidirectional genes in cervical cancer.鉴定宫颈癌中双向基因的调控区域。

BMC Med Genomics. 2013;6 Suppl 1(Suppl 1):S5. doi: 10.1186/1755-8794-6-S1-S5. Epub 2013 Jan 23.

Integrative genome-wide chromatin signature analysis using finite mixture models.基于有限混合模型的全基因组整合染色质特征分析。

BMC Genomics. 2012;13 Suppl 6(Suppl 6):S3. doi: 10.1186/1471-2164-13-S6-S3. Epub 2012 Oct 26.

The use of classification trees for bioinformatics.分类树在生物信息学中的应用。

Wiley Interdiscip Rev Data Min Knowl Discov. 2011 Jan;1(1):55-63. doi: 10.1002/widm.14. Epub 2011 Jan 6.

POLYPHEMUS: R package for comparative analysis of RNA polymerase II ChIP-seq profiles by non-linear normalization.多利弗莫斯：用于通过非线性归一化比较 RNA 聚合酶 II ChIP-seq 图谱的 R 包。

Nucleic Acids Res. 2012 Feb;40(4):e30. doi: 10.1093/nar/gkr1205. Epub 2011 Dec 7.

本文引用的文献

An optimized potential function for the calculation of nucleic acid interaction energies I. base stacking.用于计算核酸相互作用能的优化势能函数 I. 碱基堆积。

Biopolymers. 1978 Oct;17(10):2341-60. doi: 10.1002/bip.1978.360171005.

Stability of histone modifications across mammalian genomes: implications for 'epigenetic' marking.组蛋白修饰在哺乳动物基因组中的稳定性：对“表观遗传”标记的影响。

J Cell Biochem. 2009 Sep 1;108(1):22-34. doi: 10.1002/jcb.22250.

Toward a gold standard for promoter prediction evaluation.迈向启动子预测评估的金标准。

Bioinformatics. 2009 Jun 15;25(12):i313-20. doi: 10.1093/bioinformatics/btp191.

The FANTOM web resource: from mammalian transcriptional landscape to its dynamic regulation.FANTOM网络资源：从哺乳动物转录图谱到其动态调控

Genome Biol. 2009;10(4):R40. doi: 10.1186/gb-2009-10-4-r40. Epub 2009 Apr 19.

Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals.染色质特征揭示了哺乳动物中一千多种高度保守的大型非编码RNA。

Nature. 2009 Mar 12;458(7235):223-7. doi: 10.1038/nature07672. Epub 2009 Feb 1.

Global mapping of H3K4me3 and H3K27me3 reveals specificity and plasticity in lineage fate determination of differentiating CD4+ T cells.H3K4me3和H3K27me3的全基因组图谱揭示了分化中的CD4+ T细胞谱系命运决定的特异性和可塑性。

Immunity. 2009 Jan 16;30(1):155-67. doi: 10.1016/j.immuni.2008.12.009.

PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls.PeakSeq能够相对于对照对ChIP-seq实验进行系统评分。

Nat Biotechnol. 2009 Jan;27(1):66-75. doi: 10.1038/nbt.1518. Epub 2009 Jan 4.

High-resolution human core-promoter prediction with CoreBoost_HM.使用CoreBoost_HM进行高分辨率人类核心启动子预测。

Genome Res. 2009 Feb;19(2):266-75. doi: 10.1101/gr.081638.108. Epub 2008 Nov 7.

Genome-wide profiling of PPARgamma:RXR and RNA polymerase II occupancy reveals temporal activation of distinct metabolic pathways and changes in RXR dimer composition during adipogenesis.全基因组范围内PPARγ:RXR和RNA聚合酶II占据情况分析揭示脂肪生成过程中不同代谢途径的时间性激活以及RXR二聚体组成的变化。

Genes Dev. 2008 Nov 1;22(21):2953-67. doi: 10.1101/gad.501108.

Genome-wide analysis of alternative promoters of human genes using a custom promoter tiling array.使用定制的启动子平铺阵列对人类基因的可变启动子进行全基因组分析。

BMC Genomics. 2008 Jul 25;9:349. doi: 10.1186/1471-2164-9-349.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过整合 ChIP-seq Pol-II 富集数据的数据挖掘对基因启动子进行注释。

Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献