Hawkins John, Grant Charles, Noble William Stafford, Bailey Timothy L
Institute for Molecular Bioscience, University of Queensland, Qld, Australia.
Bioinformatics. 2009 Jun 15;25(12):i339-47. doi: 10.1093/bioinformatics/btp201.
A variety of algorithms have been developed to predict transcription factor binding sites (TFBSs) within the genome by exploiting the evolutionary information implicit in multiple alignments of the genomes of related species. One such approach uses an extension of the standard position-specific motif model that incorporates phylogenetic information via a phylogenetic tree and a model of evolution. However, these phylogenetic motif models (PMMs) have never been rigorously benchmarked in order to determine whether they lead to better prediction of TFBSs than obtained using simple position weight matrix scanning.
We evaluate three PMM-based prediction algorithms, each of which uses a different treatment of gapped alignments, and we compare their prediction accuracy with that of a non-phylogenetic motif scanning approach. Surprisingly, all of these algorithms appear to be inferior to simple motif scanning, when accuracy is measured using a gold standard of validated yeast TFBSs. However, the PMM scanners perform much better than simple motif scanning when we abandon the gold standard and consider the number of statistically significant sites predicted, using column-shuffled 'random' motifs to measure significance. These results suggest that the common practice of measuring the accuracy of binding site predictors using collections of known sites may be dangerously misleading since such collections may be missing 'weak' sites, which are exactly the type of sites needed to discriminate among predictors. We then extend our previous theoretical model of the statistical power of PMM-based prediction algorithms to allow for loss of binding sites during evolution, and show that it gives a more accurate upper bound on scanner accuracy. Finally, utilizing our theoretical model, we introduce a new method for predicting the number of real binding sites in a genome. The results suggest that the number of true sites for a yeast TF is in general several times greater than the number of known sites listed in the Saccharomyces cerevisiae Database (SCPD). Among the three scanning algorithms that we test, the MONKEY algorithm has the highest accuracy for predicting yeast TFBSs.
已经开发出多种算法,通过利用相关物种基因组多序列比对中隐含的进化信息来预测基因组中的转录因子结合位点(TFBS)。其中一种方法使用标准位置特异性基序模型的扩展,该模型通过系统发育树和进化模型纳入系统发育信息。然而,这些系统发育基序模型(PMM)从未经过严格的基准测试,以确定它们是否比使用简单位置权重矩阵扫描能更好地预测TFBS。
我们评估了三种基于PMM的预测算法,每种算法对空位比对的处理方式不同,并将它们的预测准确性与非系统发育基序扫描方法的准确性进行比较。令人惊讶的是,当使用经过验证的酵母TFBS的金标准来衡量准确性时,所有这些算法似乎都不如简单的基序扫描。然而,当我们放弃金标准并考虑预测的具有统计学意义的位点数量时,PMM扫描器的表现比简单的基序扫描要好得多,使用列重排的“随机”基序来衡量显著性。这些结果表明,使用已知位点集合来衡量结合位点预测器准确性的常见做法可能会产生危险的误导,因为这样的集合可能缺少“弱”位点,而这些位点恰恰是区分预测器所需的位点类型。然后,我们扩展了之前基于PMM的预测算法统计能力的理论模型,以考虑进化过程中结合位点的丢失,并表明它给出了扫描器准确性更准确的上限。最后,利用我们的理论模型,我们引入了一种预测基因组中真实结合位点数量的新方法。结果表明,酵母TF的真实位点数量通常比酿酒酵母数据库(SCPD)中列出的已知位点数量大几倍。在我们测试的三种扫描算法中,MONKEY算法在预测酵母TFBS方面具有最高的准确性。