Ray Pradipta, Shringarpure Suyash, Kolar Mladen, Xing Eric P
School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America.
PLoS Comput Biol. 2008 Jun 6;4(6):e1000090. doi: 10.1371/journal.pcbi.1000090.
Functional turnover of transcription factor binding sites (TFBSs), such as whole-motif loss or gain, are common events during genome evolution. Conventional probabilistic phylogenetic shadowing methods model the evolution of genomes only at nucleotide level, and lack the ability to capture the evolutionary dynamics of functional turnover of aligned sequence entities. As a result, comparative genomic search of non-conserved motifs across evolutionarily related taxa remains a difficult challenge, especially in higher eukaryotes, where the cis-regulatory regions containing motifs can be long and divergent; existing methods rely heavily on specialized pattern-driven heuristic search or sampling algorithms, which can be difficult to generalize and hard to interpret based on phylogenetic principles. We propose a new method: Conditional Shadowing via Multi-resolution Evolutionary Trees, or CSMET, which uses a context-dependent probabilistic graphical model that allows aligned sites from different taxa in a multiple alignment to be modeled by either a background or an appropriate motif phylogeny conditioning on the functional specifications of each taxon. The functional specifications themselves are the output of a phylogeny which models the evolution not of individual nucleotides, but of the overall functionality (e.g., functional retention or loss) of the aligned sequence segments over lineages. Combining this method with a hidden Markov model that autocorrelates evolutionary rates on successive sites in the genome, CSMET offers a principled way to take into consideration lineage-specific evolution of TFBSs during motif detection, and a readily computable analytical form of the posterior distribution of motifs under TFBS turnover. On both simulated and real Drosophila cis-regulatory modules, CSMET outperforms other state-of-the-art comparative genomic motif finders.
转录因子结合位点(TFBSs)的功能转换,如整个基序的丢失或获得,是基因组进化过程中的常见事件。传统的概率系统发育阴影方法仅在核苷酸水平上对基因组进化进行建模,缺乏捕捉比对序列实体功能转换的进化动态的能力。因此,在进化相关的分类群中对非保守基序进行比较基因组搜索仍然是一项艰巨的挑战,尤其是在高等真核生物中,其中包含基序的顺式调控区域可能很长且存在差异;现有方法严重依赖专门的模式驱动启发式搜索或采样算法,这些方法可能难以推广且难以根据系统发育原理进行解释。我们提出了一种新方法:通过多分辨率进化树进行条件阴影法(CSMET),该方法使用一种上下文相关的概率图形模型,该模型允许通过背景或基于每个分类单元功能规范的适当基序系统发育来对多重比对中不同分类单元的比对位点进行建模。功能规范本身是一个系统发育的输出,该系统发育不是对单个核苷酸的进化进行建模,而是对谱系中比对序列片段的整体功能(例如功能保留或丧失)进行建模。将该方法与一个隐藏马尔可夫模型相结合,该模型使基因组中连续位点的进化速率自相关,CSMET提供了一种有原则的方法,在基序检测过程中考虑TFBSs的谱系特异性进化,以及在TFBS转换下基序后验分布的易于计算的分析形式。在模拟的和真实的果蝇顺式调控模块上,CSMET均优于其他最先进的比较基因组基序发现工具。