Department of Computer Science, Duke University, Box 90129, Durham, NC 27708, USA.
Nucleic Acids Res. 2010 Apr;38(6):e90. doi: 10.1093/nar/gkp1166. Epub 2010 Jan 4.
As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do.
随着越来越多的真核生物基因组被测序,旨在检测基因间序列调控元件的比较研究变得越来越普遍。大多数用于转录因子(TF)结合位点发现的比较方法都利用同源调控区的全局或局部比对来评估特定的 DNA 位点是否在相关生物中保守,因此更有可能具有功能。由于结合位点通常较短,有时会退化,并且通常与方向无关,因此对齐算法可能无法正确对齐它们。在这里,我们提出了一种新颖的、无需对齐的方法,用于利用保守信息进行 TF 结合位点发现。我们放宽了保守位点的定义:如果一个 DNA 位点出现在调控区域的同源序列中的任何位置,无论方向如何,我们都认为该位点在该序列中是保守的。我们使用这个定义来推导出关于 DNA 序列位置的有用先验概率,并将这些先验概率纳入 motif 发现的 Gibbs 采样算法中。我们的方法简单快速。它既不需要序列比对,也不需要同源序列之间的系统发育关系,但在真实生物数据上比需要进行序列比对的方法更有效。