Djordjevic Marko, Djordjevic Magdalena, Zdobnov Evgeny
Institute of Physiology and Biochemistry, Faculty of Biology, University of Belgrade, Belgrade, Serbia.
Institute of Physics Belgrade, University of Belgrade, Belgrade, Serbia.
Front Microbiol. 2017 Nov 22;8:2314. doi: 10.3389/fmicb.2017.02314. eCollection 2017.
Reliable identification of targets of bacterial regulators is necessary to understand bacterial gene expression regulation. These targets are commonly predicted by searching for high-scoring binding sites in the upstream genomic regions, which typically leads to a large number of false positives. In contrast to the common approach, here we propose a novel concept, where overrepresentation of the scoring distribution that corresponds to the entire searched region is assessed, as opposed to predicting individual binding sites. We explore two implementations of this concept, based on Kolmogorov-Smirnov (KS) and Anderson-Darling (AD) tests, which both provide straightforward -value estimates for predicted targets. This approach is implemented for pleiotropic bacterial regulators, including σ (bacterial housekeeping σ factor) target predictions, which is a classical bioinformatics problem characterized by low specificity. We show that KS based approach is both faster and more accurate, departing from the current paradigm of AD being slower, but more accurate. Moreover, KS approach leads to a significant increase in the search accuracy compared to the standard approach, while at the same time straightforwardly assigning well established -values to each potential target. Consequently, the new KS based method proposed here, which assigns -values to fixed length upstream regions, provides a fast and accurate approach for predicting bacterial transcription targets.
可靠地识别细菌调节因子的靶标对于理解细菌基因表达调控至关重要。这些靶标通常通过在基因组上游区域搜索高分结合位点来预测,这通常会导致大量的假阳性结果。与常规方法不同,在此我们提出了一种新的概念,即评估与整个搜索区域相对应的评分分布的过度代表性,而不是预测单个结合位点。我们基于柯尔莫哥洛夫-斯米尔诺夫(KS)检验和安德森-达林(AD)检验探索了这一概念的两种实现方式,这两种检验都能为预测的靶标提供直接的P值估计。这种方法应用于多效性细菌调节因子,包括σ(细菌管家σ因子)靶标的预测,这是一个具有低特异性特征的经典生物信息学问题。我们表明基于KS的方法既更快又更准确,与当前认为AD较慢但更准确的范式不同。此外,与标准方法相比,KS方法导致搜索准确性显著提高,同时能直接为每个潜在靶标赋予公认的P值。因此,这里提出的基于KS的新方法,为固定长度的上游区域赋予P值,为预测细菌转录靶标提供了一种快速且准确的方法。