Wang Ting, Stormo Gary D
Department of Genetics, Washington University Medical School, St. Louis, MO 63110, USA.
Bioinformatics. 2003 Dec 12;19(18):2369-80. doi: 10.1093/bioinformatics/btg329.
Discovery of regulatory motifs in unaligned DNA sequences remains a fundamental problem in computational biology. Two categories of algorithms have been developed to identify common motifs from a set of DNA sequences. The first can be called a 'multiple genes, single species' approach. It proposes that a degenerate motif is embedded in some or all of the otherwise unrelated input sequences and tries to describe a consensus motif and identify its occurrences. It is often used for co-regulated genes identified through experimental approaches. The second approach can be called 'single gene, multiple species'. It requires orthologous input sequences and tries to identify unusually well conserved regions by phylogenetic footprinting. Both approaches perform well, but each has some limitations. It is tempting to combine the knowledge of co-regulation among different genes and conservation among orthologous genes to improve our ability to identify motifs.
Based on the Consensus algorithm previously established by our group, we introduce a new algorithm called PhyloCon (Phylogenetic Consensus) that takes into account both conservation among orthologous genes and co-regulation of genes within a species. This algorithm first aligns conserved regions of orthologous sequences into multiple sequence alignments, or profiles, then compares profiles representing non-orthologous sequences. Motifs emerge as common regions in these profiles. Here we present a novel statistic to compare profiles of DNA sequences and a greedy approach to search for common subprofiles. We demonstrate that PhyloCon performs well on both synthetic and biological data.
Software available upon request from the authors. http://ural.wustl.edu/softwares.html
在未比对的DNA序列中发现调控基序仍然是计算生物学中的一个基本问题。已经开发了两类算法来从一组DNA序列中识别共同基序。第一类可以称为“多个基因,单个物种”方法。它提出在一些或所有原本不相关的输入序列中嵌入一个简并基序,并试图描述一个共有基序并识别其出现位置。它通常用于通过实验方法鉴定的共调控基因。第二种方法可以称为“单个基因,多个物种”。它需要直系同源输入序列,并试图通过系统发育足迹法识别异常保守的区域。这两种方法都表现良好,但每种方法都有一些局限性。将不同基因之间的共调控知识和直系同源基因之间的保守性知识结合起来,以提高我们识别基序的能力,这很有吸引力。
基于我们小组先前建立的一致性算法,我们引入了一种名为PhyloCon(系统发育一致性)的新算法,该算法同时考虑了直系同源基因之间的保守性和物种内基因的共调控。该算法首先将直系同源序列的保守区域比对成多序列比对或图谱,然后比较代表非直系同源序列的图谱。基序作为这些图谱中的共同区域出现。在这里,我们提出了一种新颖的统计量来比较DNA序列的图谱,并提出了一种贪婪方法来搜索共同的子图谱。我们证明PhyloCon在合成数据和生物学数据上均表现良好。
可根据作者要求提供软件。http://ural.wustl.edu/softwares.html