ProSOM：基于DNA物理图谱无监督聚类的核心启动子预测

ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles.

作者信息

Abeel Thomas, Saeys Yvan, Rouzé Pierre, Van de Peer Yves

机构信息

Department of Plant Systems Biology, VIB, 9052 Gent, Belgium.

出版信息

Bioinformatics. 2008 Jul 1;24(13):i24-31. doi: 10.1093/bioinformatics/btn172.

DOI:10.1093/bioinformatics/btn172

PMID:18586720

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2718650/

Abstract

MOTIVATION

More and more genomes are being sequenced, and to keep up with the pace of sequencing projects, automated annotation techniques are required. One of the most challenging problems in genome annotation is the identification of the core promoter. Because the identification of the transcription initiation region is such a challenging problem, it is not yet a common practice to integrate transcription start site prediction in genome annotation projects. Nevertheless, better core promoter prediction can improve genome annotation and can be used to guide experimental work.

RESULTS

Comparing the average structural profile based on base stacking energy of transcribed, promoter and intergenic sequences demonstrates that the core promoter has unique features that cannot be found in other sequences. We show that unsupervised clustering by using self-organizing maps can clearly distinguish between the structural profiles of promoter sequences and other genomic sequences. An implementation of this promoter prediction program, called ProSOM, is available and has been compared with the state-of-the-art. We propose an objective, accurate and biologically sound validation scheme for core promoter predictors. ProSOM performs at least as well as the software currently available, but our technique is more balanced in terms of the number of predicted sites and the number of false predictions, resulting in a better all-round performance. Additional tests on the ENCODE regions of the human genome show that 98% of all predictions made by ProSOM can be associated with transcriptionally active regions, which demonstrates the high precision.

AVAILABILITY

Predictions for the human genome, the validation datasets and the program (ProSOM) are available upon request.

摘要

动机

越来越多的基因组正在被测序，为了跟上测序项目的步伐，需要自动化注释技术。基因组注释中最具挑战性的问题之一是核心启动子的识别。由于转录起始区域的识别是一个极具挑战性的问题，在基因组注释项目中整合转录起始位点预测尚未成为一种常见做法。然而，更好的核心启动子预测可以改善基因组注释，并可用于指导实验工作。

结果

基于转录序列、启动子序列和基因间序列的碱基堆积能比较平均结构概况表明，核心启动子具有其他序列中找不到的独特特征。我们表明，使用自组织映射进行无监督聚类可以清楚地区分启动子序列和其他基因组序列的结构概况。一个名为ProSOM的启动子预测程序已经实现，并与现有最先进的程序进行了比较。我们为核心启动子预测器提出了一种客观、准确且生物学上合理的验证方案。ProSOM的性能至少与目前可用的软件相当，但我们的技术在预测位点数量和错误预测数量方面更加平衡，从而具有更好的全面性能。对人类基因组ENCODE区域的额外测试表明，ProSOM做出的所有预测中有98%可与转录活性区域相关联，这证明了其高精度。

可用性

可根据要求提供人类基因组的预测结果、验证数据集和程序（ProSOM）。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d2c6/2718650/6c0a3cc7f5a7/btn172f1.jpg

相似文献

ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles.

Bioinformatics. 2008 Jul 1;24(13):i24-31. doi: 10.1093/bioinformatics/btn172.

Decomposition of overlapping patterns by cumulative local cross-correlation.

J Bioinform Comput Biol. 2006 Apr;4(2):571-87. doi: 10.1142/s021972000600193x.

Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages.

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S6. doi: 10.1186/1471-2105-8-S4-S6.

SpliceMachine: predicting splice sites from high-dimensional local context representations.

Bioinformatics. 2005 Apr 15;21(8):1332-8. doi: 10.1093/bioinformatics/bti166. Epub 2004 Nov 25.

Using CorePromoter to find human core promoters.

Curr Protoc Bioinformatics. 2005 Jul;Chapter 2:Unit 2.9. doi: 10.1002/0471250953.bi0209s10.

Clustering of main orthologs for multiple genomes.

Comput Syst Bioinformatics Conf. 2007;6:195-201.

Promoter prediction analysis on the whole human genome.

Nat Biotechnol. 2004 Nov;22(11):1467-73. doi: 10.1038/nbt1032.

Identification of coding and non-coding sequences using local Holder exponent formalism.

Bioinformatics. 2005 Oct 15;21(20):3818-23. doi: 10.1093/bioinformatics/bti639. Epub 2005 Aug 23.

WindowMasker: window-based masker for sequenced genomes.

Bioinformatics. 2006 Jan 15;22(2):134-41. doi: 10.1093/bioinformatics/bti774. Epub 2005 Nov 15.

Splice site identification by idlBNs.

Bioinformatics. 2004 Aug 4;20 Suppl 1:i69-76. doi: 10.1093/bioinformatics/bth932.

引用本文的文献

Physical Peculiarity of Two Sites in Human Promoters: Universality and Diverse Usage in Gene Function.

Int J Mol Sci. 2024 Jan 25;25(3):1487. doi: 10.3390/ijms25031487.

From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome.

Hum Genomics. 2022 Feb 18;16(1):7. doi: 10.1186/s40246-022-00376-1.

TSSFinder-fast and accurate ab initio prediction of the core promoter in eukaryotic genomes.

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab198.

In silico analysis of promoter regions and regulatory elements (motifs and CpG islands) of the genes encoding for alcohol production in Saccharomyces cerevisiaea S288C and Schizosaccharomyces pombe 972h.

J Genet Eng Biotechnol. 2021 Jan 11;19(1):8. doi: 10.1186/s43141-020-00097-9.

Identification of Regulatory SNPs Associated with Vicine and Convicine Content of Based on Genotyping by Sequencing Data Using Deep Learning.

Genes (Basel). 2020 Jun 5;11(6):614. doi: 10.3390/genes11060614.

Characterization of bovine (Bos taurus) imprinted genes from genomic to amino acid attributes by data mining approaches.

PLoS One. 2019 Jun 6;14(6):e0217813. doi: 10.1371/journal.pone.0217813. eCollection 2019.

DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions.

Bioinformatics. 2019 Apr 1;35(7):1125-1132. doi: 10.1093/bioinformatics/bty752.

70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features.

BMC Syst Biol. 2018 Apr 24;12(Suppl 4):44. doi: 10.1186/s12918-018-0570-1.

Evolution of Brain Active Gene Promoters in Human Lineage Towards the Increased Plasticity of Gene Regulation.

Mol Neurobiol. 2018 Mar;55(3):1871-1904. doi: 10.1007/s12035-017-0427-4. Epub 2017 Feb 24.

The impact of sequence length and number of sequences on promoter prediction performance.

BMC Bioinformatics. 2015;16 Suppl 19(Suppl 19):S5. doi: 10.1186/1471-2105-16-S19-S5. Epub 2015 Dec 16.

本文引用的文献

An optimized potential function for the calculation of nucleic acid interaction energies I. base stacking.

Biopolymers. 1978 Oct;17(10):2341-60. doi: 10.1002/bip.1978.360171005.

EnsemPro: an ensemble approach to predicting transcription start sites in human genomic DNA sequences.

Genomics. 2008 Mar;91(3):259-66. doi: 10.1016/j.ygeno.2007.11.001.

Generic eukaryotic core promoter prediction using structural features of DNA.

Genome Res. 2008 Feb;18(2):310-23. doi: 10.1101/gr.6991408. Epub 2007 Dec 20.

Steady progress and recent breakthroughs in the accuracy of automated genome annotation.

Nat Rev Genet. 2008 Jan;9(1):62-73. doi: 10.1038/nrg2220.

The UCSC Genome Browser Database: 2008 update.

Nucleic Acids Res. 2008 Jan;36(Database issue):D773-9. doi: 10.1093/nar/gkm966. Epub 2007 Dec 17.

Determining promoter location based on DNA structure first-principles calculations.

Genome Biol. 2007;8(12):R263. doi: 10.1186/gb-2007-8-12-r263.

A code for transcription initiation in mammalian genomes.

Genome Res. 2008 Jan;18(1):1-12. doi: 10.1101/gr.6831208. Epub 2007 Nov 21.

Ensembl 2008.

Nucleic Acids Res. 2008 Jan;36(Database issue):D707-14. doi: 10.1093/nar/gkm988. Epub 2007 Nov 13.

Prediction of transcription start sites based on feature selection using AMOSA.

Comput Syst Bioinformatics Conf. 2007;6:183-93.

DBTSS: database of transcription start sites, progress report 2008.

Nucleic Acids Res. 2008 Jan;36(Database issue):D97-101. doi: 10.1093/nar/gkm901. Epub 2007 Oct 16.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ProSOM：基于DNA物理图谱无监督聚类的核心启动子预测

ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles.

作者信息

Abeel Thomas, Saeys Yvan, Rouzé Pierre, Van de Peer Yves

机构信息

Department of Plant Systems Biology, VIB, 9052 Gent, Belgium.