Institute of Systems and Synthetic Biology, CNRS, University of Evry, Genopole, 91030 Evry, France.
Nucleic Acids Res. 2013 Feb 1;41(3):1406-15. doi: 10.1093/nar/gks1286. Epub 2012 Dec 14.
Conventional approaches to predict transcriptional regulatory interactions usually rely on the definition of a shared motif sequence on the target genes of a transcription factor (TF). These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices, which may match large numbers of sites and produce an unreliable list of target genes. To improve the prediction of binding sites, we propose to additionally use the unrelated knowledge of the genome layout. Indeed, it has been shown that co-regulated genes tend to be either neighbors or periodically spaced along the whole chromosome. This study demonstrates that respective gene positioning carries significant information. This novel type of information is combined with traditional sequence information by a machine learning algorithm called PreCisIon. To optimize this combination, PreCisIon builds a strong gene target classifier by adaptively combining weak classifiers based on either local binding sequence or global gene position. This strategy generically paves the way to the optimized incorporation of any future advances in gene target prediction based on local sequence, genome layout or on novel criteria. With the current state of the art, PreCisIon consistently improves methods based on sequence information only. This is shown by implementing a cross-validation analysis of the 20 major TFs from two phylogenetically remote model organisms. For Bacillus subtilis and Escherichia coli, respectively, PreCisIon achieves on average an area under the receiver operating characteristic curve of 70 and 60%, a sensitivity of 80 and 70% and a specificity of 60 and 56%. The newly predicted gene targets are demonstrated to be functionally consistent with previously known targets, as assessed by analysis of Gene Ontology enrichment or of the relevant literature and databases.
传统的预测转录调控相互作用的方法通常依赖于转录因子 (TF) 靶基因上共享基序序列的定义。这些努力受到 TF 结合位点基序的可用性和准确性的限制,这些基序通常表示为位置特异性评分矩阵,这些矩阵可能匹配大量的位点,并产生不可靠的靶基因列表。为了提高结合位点的预测能力,我们建议另外使用基因组布局的不相关知识。事实上,已经表明,共同调节的基因往往是邻居,或者沿着整个染色体周期性地间隔开。本研究表明,相应的基因定位携带重要信息。这种新型信息与传统的序列信息相结合,通过一种称为 PreCisIon 的机器学习算法。为了优化这种组合,PreCisIon 通过自适应地结合基于局部结合序列或全局基因位置的弱分类器来构建强大的基因目标分类器。这种策略通常为基于局部序列、基因组布局或新准则的基因目标预测的任何未来进展铺平了道路。利用当前的技术水平,PreCisIon 始终如一地改进了仅基于序列信息的方法。这通过对来自两个系统发育上遥远的模式生物的 20 个主要 TF 进行交叉验证分析来证明。对于枯草芽孢杆菌和大肠杆菌,PreCisIon 分别平均实现了 70%和 60%的接收者操作特征曲线下面积、80%和 70%的灵敏度以及 60%和 56%的特异性。通过分析基因本体论富集或相关文献和数据库,证明新预测的基因靶标在功能上与先前已知的靶标一致。