Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain.
PLoS Comput Biol. 2010 Jul 22;6(7):e1000862. doi: 10.1371/journal.pcbi.1000862.
Transcriptional regulators recognize specific DNA sequences. Because these sequences are embedded in the background of genomic DNA, it is hard to identify the key cis-regulatory elements that determine disparate patterns of gene expression. The detection of the intra- and inter-species differences among these sequences is crucial for understanding the molecular basis of both differential gene expression and evolution. Here, we address this problem by investigating the target promoters controlled by the DNA-binding PhoP protein, which governs virulence and Mg(2+) homeostasis in several bacterial species. PhoP is particularly interesting; it is highly conserved in different gamma/enterobacteria, regulating not only ancestral genes but also governing the expression of dozens of horizontally acquired genes that differ from species to species. Our approach consists of decomposing the DNA binding site sequences for a given regulator into families of motifs (i.e., termed submotifs) using a machine learning method inspired by the "Divide & Conquer" strategy. By partitioning a motif into sub-patterns, computational advantages for classification were produced, resulting in the discovery of new members of a regulon, and alleviating the problem of distinguishing functional sites in chromatin immunoprecipitation and DNA microarray genome-wide analysis. Moreover, we found that certain partitions were useful in revealing biological properties of binding site sequences, including modular gains and losses of PhoP binding sites through evolutionary turnover events, as well as conservation in distant species. The high conservation of PhoP submotifs within gamma/enterobacteria, as well as the regulatory protein that recognizes them, suggests that the major cause of divergence between related species is not due to the binding sites, as was previously suggested for other regulators. Instead, the divergence may be attributed to the fast evolution of orthologous target genes and/or the promoter architectures resulting from the interaction of those binding sites with the RNA polymerase.
转录调控因子识别特定的 DNA 序列。由于这些序列嵌入在基因组 DNA 的背景中,因此很难确定决定基因表达不同模式的关键顺式调控元件。检测这些序列在种内和种间的差异对于理解差异基因表达和进化的分子基础至关重要。在这里,我们通过研究受 DNA 结合 PhoP 蛋白控制的靶启动子来解决这个问题,PhoP 蛋白在几种细菌中控制着毒力和 Mg(2+)稳态。PhoP 非常有趣;它在不同的γ/肠杆菌中高度保守,不仅调节祖先基因,还调控数十个水平获得的基因的表达,这些基因在不同物种之间存在差异。我们的方法包括使用受“分而治之”策略启发的机器学习方法将给定调控因子的 DNA 结合位点序列分解为基序(即称为亚基序)家族。通过将基序划分为子模式,可以产生分类的计算优势,从而发现调控子的新成员,并缓解在染色质免疫沉淀和 DNA 微阵列全基因组分析中区分功能位点的问题。此外,我们发现某些分区有助于揭示结合位点序列的生物学特性,包括通过进化更替事件导致 PhoP 结合位点的模块化获得和丧失,以及在远缘物种中的保守性。PhoP 亚基序在γ/肠杆菌中的高度保守性,以及识别它们的调节蛋白,表明相关物种之间的主要分歧原因不是结合位点,这与之前对其他调控因子的假设不同。相反,这种分歧可能归因于同源靶基因的快速进化和/或由这些结合位点与 RNA 聚合酶相互作用产生的启动子结构。