Djordjevic Marko, Sengupta Anirvan M, Shraiman Boris I
Department of Physics, Columbia University, New York, New York 10025, USA.
Genome Res. 2003 Nov;13(11):2381-90. doi: 10.1101/gr.1271603.
Identification of transcription factor binding sites within regulatory segments of genomic DNA is an important step toward understanding of the regulatory circuits that control expression of genes. Here, we describe a novel bioinformatics method that bases classification of potential binding sites explicitly on the estimate of sequence-specific binding energy of a given transcription factor. The method also estimates the chemical potential of the factor that defines the threshold of binding. In contrast with the widely used information-theoretic weight matrix method, the new approach correctly describes saturation in the transcription factor/DNA binding probability. This results in a significant improvement in the number of expected false positives, particularly in the ubiquitous case of low-specificity factors. In the strong binding limit, the algorithm is related to the "support vector machine" approach to pattern recognition. The new method is used to identify likely genomic binding sites for the E. coli transcription factors collected in the DPInteract database. In addition, for CRP (a global regulatory factor), the likely regulatory modality (i.e., repressor or activator) of predicted binding sites is determined.
识别基因组DNA调控片段中的转录因子结合位点是理解控制基因表达的调控回路的重要一步。在此,我们描述了一种新的生物信息学方法,该方法明确基于给定转录因子的序列特异性结合能估计对潜在结合位点进行分类。该方法还估计了定义结合阈值的因子的化学势。与广泛使用的信息论权重矩阵方法不同,新方法正确描述了转录因子/DNA结合概率中的饱和度。这导致预期假阳性数量显著减少,特别是在低特异性因子普遍存在的情况下。在强结合极限下,该算法与模式识别中的“支持向量机”方法相关。新方法用于识别DPInteract数据库中收集的大肠杆菌转录因子可能的基因组结合位点。此外,对于CRP(一种全局调控因子),确定了预测结合位点可能的调控方式(即阻遏物或激活物)。