Zhang Shaoqiang, Xu Minli, Li Shan, Su Zhengchang
Department of Bioinformatics and Genomics, Bioinformatics Research Center, the University of North Carolina at Charlotte, Charlotte, NC 28223, USA.
Nucleic Acids Res. 2009 Jun;37(10):e72. doi: 10.1093/nar/gkp248. Epub 2009 Apr 21.
Although cis-regulatory binding sites (CRBSs) are at least as important as the coding sequences in a genome, our general understanding of them in most sequenced genomes is very limited due to the lack of efficient and accurate experimental and computational methods for their characterization, which has largely hindered our understanding of many important biological processes. In this article, we describe a novel algorithm for genome-wide de novo prediction of CRBSs with high accuracy. We designed our algorithm to circumvent three identified difficulties for CRBS prediction using comparative genomics principles based on a new method for the selection of reference genomes, a new metric for measuring the similarity of CRBSs, and a new graph clustering procedure. When operon structures are correctly predicted, our algorithm can predict 81% of known individual binding sites belonging to 94% of known cis-regulatory motifs in the Escherichia coli K12 genome, while achieving high prediction specificity. Our algorithm has also achieved similar prediction accuracy in the Bacillus subtilis genome, suggesting that it is very robust, and thus can be applied to any other sequenced prokaryotic genome. When compared with the prior state-of-the-art algorithms, our algorithm outperforms them in both prediction sensitivity and specificity.
尽管顺式调控结合位点(CRBSs)在基因组中的重要性至少与编码序列相当,但由于缺乏用于其特征描述的高效且准确的实验和计算方法,我们对大多数已测序基因组中CRBSs的总体了解非常有限,这在很大程度上阻碍了我们对许多重要生物学过程的理解。在本文中,我们描述了一种用于全基因组从头高精度预测CRBSs的新算法。我们基于一种选择参考基因组的新方法、一种测量CRBSs相似性的新指标以及一种新的图聚类程序,利用比较基因组学原理设计算法,以规避CRBS预测中已识别出的三个困难。当操纵子结构被正确预测时,我们的算法能够预测大肠杆菌K12基因组中属于94%已知顺式调控基序的81%的已知单个结合位点,同时实现高预测特异性。我们的算法在枯草芽孢杆菌基因组中也取得了类似的预测准确性,这表明它非常稳健,因此可应用于任何其他已测序的原核生物基因组。与先前的最先进算法相比,我们的算法在预测敏感性和特异性方面均优于它们。