Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
Genome Res. 2010 Apr;20(4):526-36. doi: 10.1101/gr.096305.109. Epub 2010 Mar 10.
Information about the binding preferences of many transcription factors is known and characterized by a sequence binding motif. However, determining regions of the genome in which a transcription factor binds based on its motif is a challenging problem, particularly in species with large genomes, since there are often many sequences containing matches to the motif but are not bound. Several rules based on sequence conservation or location, relative to a transcription start site, have been proposed to help differentiate true binding sites from random ones. Other evidence sources may also be informative for this task. We developed a method for integrating multiple evidence sources using logistic regression classifiers. Our method works in two steps. First, we infer a score quantifying the general binding preferences of transcription factor binding at all locations based on a large set of evidence features, without using any motif specific information. Then, we combined this general binding preference score with motif information for specific transcription factors to improve prediction of regions bound by the factor. Using cross-validation and new experimental data we show that, surprisingly, the general binding preference can be highly predictive of true locations of transcription factor binding even when no binding motif is used. When combined with motif information our method outperforms previous methods for predicting locations of true binding.
许多转录因子的结合偏好信息已经为人所知,并通过序列结合基序来进行特征描述。然而,基于其基序来确定转录因子结合的基因组区域是一个具有挑战性的问题,特别是在具有大型基因组的物种中,因为通常有许多序列包含与基序匹配但未被结合的序列。已经提出了一些基于序列保守性或位置的规则,相对于转录起始位点,以帮助将真正的结合位点与随机的结合位点区分开来。其他证据来源也可能对此任务提供信息。我们开发了一种使用逻辑回归分类器整合多个证据来源的方法。我们的方法分两步进行。首先,我们根据大量证据特征推断出一个分数,该分数量化了转录因子在所有位置的一般结合偏好,而不使用任何特定基序的信息。然后,我们将这个一般结合偏好分数与特定转录因子的基序信息结合起来,以提高对该因子结合的区域的预测。通过交叉验证和新的实验数据,我们惊讶地发现,即使不使用结合基序,一般结合偏好也可以高度预测转录因子结合的真实位置。当与基序信息结合使用时,我们的方法在预测真实结合位置方面优于以前的方法。