Abeel Thomas, Saeys Yvan, Rouzé Pierre, Van de Peer Yves
Department of Plant Systems Biology, VIB, 9052 Gent, Belgium.
Bioinformatics. 2008 Jul 1;24(13):i24-31. doi: 10.1093/bioinformatics/btn172.
More and more genomes are being sequenced, and to keep up with the pace of sequencing projects, automated annotation techniques are required. One of the most challenging problems in genome annotation is the identification of the core promoter. Because the identification of the transcription initiation region is such a challenging problem, it is not yet a common practice to integrate transcription start site prediction in genome annotation projects. Nevertheless, better core promoter prediction can improve genome annotation and can be used to guide experimental work.
Comparing the average structural profile based on base stacking energy of transcribed, promoter and intergenic sequences demonstrates that the core promoter has unique features that cannot be found in other sequences. We show that unsupervised clustering by using self-organizing maps can clearly distinguish between the structural profiles of promoter sequences and other genomic sequences. An implementation of this promoter prediction program, called ProSOM, is available and has been compared with the state-of-the-art. We propose an objective, accurate and biologically sound validation scheme for core promoter predictors. ProSOM performs at least as well as the software currently available, but our technique is more balanced in terms of the number of predicted sites and the number of false predictions, resulting in a better all-round performance. Additional tests on the ENCODE regions of the human genome show that 98% of all predictions made by ProSOM can be associated with transcriptionally active regions, which demonstrates the high precision.
Predictions for the human genome, the validation datasets and the program (ProSOM) are available upon request.
越来越多的基因组正在被测序,为了跟上测序项目的步伐,需要自动化注释技术。基因组注释中最具挑战性的问题之一是核心启动子的识别。由于转录起始区域的识别是一个极具挑战性的问题,在基因组注释项目中整合转录起始位点预测尚未成为一种常见做法。然而,更好的核心启动子预测可以改善基因组注释,并可用于指导实验工作。
基于转录序列、启动子序列和基因间序列的碱基堆积能比较平均结构概况表明,核心启动子具有其他序列中找不到的独特特征。我们表明,使用自组织映射进行无监督聚类可以清楚地区分启动子序列和其他基因组序列的结构概况。一个名为ProSOM的启动子预测程序已经实现,并与现有最先进的程序进行了比较。我们为核心启动子预测器提出了一种客观、准确且生物学上合理的验证方案。ProSOM的性能至少与目前可用的软件相当,但我们的技术在预测位点数量和错误预测数量方面更加平衡,从而具有更好的全面性能。对人类基因组ENCODE区域的额外测试表明,ProSOM做出的所有预测中有98%可与转录活性区域相关联,这证明了其高精度。
可根据要求提供人类基因组的预测结果、验证数据集和程序(ProSOM)。