Research Group Systems Biology/Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology, Hans Knöll Institute, 07745 Jena, Germany.
Bioinformatics. 2011 Oct 15;27(20):2806-11. doi: 10.1093/bioinformatics/btr492. Epub 2011 Sep 4.
Prediction of transcription factor binding sites (TFBSs) is crucial for promoter modeling and network inference. Quality of the predictions is spoiled by numerous false positives, which persist as the main problem for all presently available TFBS search methods.
We suggest a novel approach, which is alternative to widely used position weight matrices (PWMs) and Hidden Markov Models. Each motif of the input set is used as a search template to scan a query sequence. Found motifs are assigned scores depending on the non-randomness of the motif's occurrence, the number of matching searching motifs and the number of mismatches. The non-randomness is estimated by comparison of observed numbers of matching motifs with those predicted to occur by chance. The latter can be calculated given the base compositions of the motif and the query sequence. The method does not require preliminary alignment of the input motifs, hence avoiding uncertainties introduced by the alignment procedure. In comparison with PWM-based tools, our method demonstrates higher precision by the same sensitivity and specificity. It also tends to outperform methods combining pattern and PWM search. Most important, it allows reducing the number of false positive predictions significantly.
The method is implemented in a tool called SiTaR (Site Tracking and Recognition) and is available at http://sbi.hki-jena.de/sitar/index.php.
Supplementary data are available at Bioinformatics online.
转录因子结合位点 (TFBS) 的预测对于启动子建模和网络推断至关重要。由于存在大量的假阳性,预测的质量受到了影响,而这些假阳性一直是所有现有 TFBS 搜索方法的主要问题。
我们提出了一种新的方法,它与广泛使用的位置权重矩阵 (PWMs) 和隐马尔可夫模型不同。输入集中的每个基序都用作搜索模板来扫描查询序列。找到的基序根据基序出现的非随机性、匹配搜索基序的数量和不匹配的数量来分配分数。非随机性通过将观察到的匹配基序数量与随机发生的基序数量进行比较来估计。后者可以根据基序和查询序列的碱基组成来计算。该方法不需要对输入基序进行初步对齐,从而避免了对齐过程中引入的不确定性。与基于 PWM 的工具相比,我们的方法在相同的灵敏度和特异性下表现出更高的精度。它也倾向于优于结合模式和 PWM 搜索的方法。最重要的是,它可以显著减少假阳性预测的数量。
该方法在称为 SiTaR(Site Tracking and Recognition)的工具中实现,并可在 http://sbi.hki-jena.de/sitar/index.php 上获得。
补充数据可在 Bioinformatics 在线获得。