Rombauts Stephane, Florquin Kobe, Lescot Magali, Marchal Kathleen, Rouzé Pierre, van de Peer Yves
Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology, Ghent University, B-9000 Gent, Belgium.
Plant Physiol. 2003 Jul;132(3):1162-76. doi: 10.1104/pp.102.017715.
The identification of promoters and their regulatory elements is one of the major challenges in bioinformatics and integrates comparative, structural, and functional genomics. Many different approaches have been developed to detect conserved motifs in a set of genes that are either coregulated or orthologous. However, although recent approaches seem promising, in general, unambiguous identification of regulatory elements is not straightforward. The delineation of promoters is even harder, due to its complex nature, and in silico promoter prediction is still in its infancy. Here, we review the different approaches that have been developed for identifying promoters and their regulatory elements. We discuss the detection of cis-acting regulatory elements using word-counting or probabilistic methods (so-called "search by signal" methods) and the delineation of promoters by considering both sequence content and structural features ("search by content" methods). As an example of search by content, we explored in greater detail the association of promoters with CpG islands. However, due to differences in sequence content, the parameters used to detect CpG islands in humans and other vertebrates cannot be used for plants. Therefore, a preliminary attempt was made to define parameters that could possibly define CpG and CpNpG islands in Arabidopsis, by exploring the compositional landscape around the transcriptional start site. To this end, a data set of more than 5,000 gene sequences was built, including the promoter region, the 5'-untranslated region, and the first introns and coding exons. Preliminary analysis shows that promoter location based on the detection of potential CpG/CpNpG islands in the Arabidopsis genome is not straightforward. Nevertheless, because the landscape of CpG/CpNpG islands differs considerably between promoters and introns on the one side and exons (whether coding or not) on the other, more sophisticated approaches can probably be developed for the successful detection of "putative" CpG and CpNpG islands in plants.
启动子及其调控元件的识别是生物信息学中的主要挑战之一,它整合了比较基因组学、结构基因组学和功能基因组学。人们已经开发出许多不同的方法来检测一组共调控或直系同源基因中的保守基序。然而,尽管最近的方法看起来很有前景,但一般来说,明确识别调控元件并非易事。由于启动子的性质复杂,其界定更加困难,而且计算机模拟的启动子预测仍处于起步阶段。在这里,我们综述了为识别启动子及其调控元件而开发的不同方法。我们讨论了使用词计数或概率方法(所谓的“信号搜索”方法)检测顺式作用调控元件,以及通过考虑序列内容和结构特征来界定启动子(“内容搜索”方法)。作为内容搜索的一个例子,我们更详细地探讨了启动子与CpG岛的关联。然而,由于序列内容的差异,用于检测人类和其他脊椎动物中CpG岛的参数不能用于植物。因此,我们初步尝试通过探索转录起始位点周围的组成情况来定义可能界定拟南芥中CpG和CpNpG岛的参数。为此,构建了一个包含5000多个基因序列的数据集,包括启动子区域、5'非翻译区、第一个内含子和编码外显子。初步分析表明,基于拟南芥基因组中潜在CpG/CpNpG岛的检测来确定启动子位置并非易事。尽管如此,由于CpG/CpNpG岛在启动子和内含子一侧与外显子(无论是否编码)另一侧之间的情况有很大差异,或许可以开发更复杂的方法来成功检测植物中的“假定”CpG和CpNpG岛。