Abeel Thomas, Saeys Yvan, Bonnet Eric, Rouzé Pierre, Van de Peer Yves
Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium,
Genome Res. 2008 Feb;18(2):310-23. doi: 10.1101/gr.6991408. Epub 2007 Dec 20.
Despite many recent efforts, in silico identification of promoter regions is still in its infancy. However, the accurate identification and delineation of promoter regions is important for several reasons, such as improving genome annotation and devising experiments to study and understand transcriptional regulation. Current methods to identify the core region of promoters require large amounts of high-quality training data and often behave like black box models that output predictions that are difficult to interpret. Here, we present a novel approach for predicting promoters in whole-genome sequences by using large-scale structural properties of DNA. Our technique requires no training, is applicable to many eukaryotic genomes, and performs extremely well in comparison with the best available promoter prediction programs. Moreover, it is fast, simple in design, and has no size constraints, and the results are easily interpretable. We compared our approach with 14 current state-of-the-art implementations using human gene and transcription start site data and analyzed the ENCODE region in more detail. We also validated our method on 12 additional eukaryotic genomes, including vertebrates, invertebrates, plants, fungi, and protists.
尽管最近做出了许多努力,但在计算机上识别启动子区域仍处于起步阶段。然而,准确识别和界定启动子区域很重要,原因有几个,比如改进基因组注释以及设计实验来研究和理解转录调控。目前识别启动子核心区域的方法需要大量高质量的训练数据,并且通常表现得像黑箱模型,输出难以解释的预测结果。在此,我们提出一种通过利用DNA的大规模结构特性来预测全基因组序列中启动子的新方法。我们的技术无需训练,适用于许多真核生物基因组,并且与现有的最佳启动子预测程序相比表现极佳。此外,它速度快、设计简单且没有大小限制,结果易于解释。我们使用人类基因和转录起始位点数据将我们的方法与14种当前最先进的实现方法进行了比较,并更详细地分析了ENCODE区域。我们还在另外12个真核生物基因组上验证了我们的方法,这些基因组包括脊椎动物、无脊椎动物、植物、真菌和原生生物。