Narasimhan Chandrasegaran, LoCascio Philip, Uberbacher Edward
Life Sciences Division, Oak Ridge National Laboratory, PO Box 3480, Oak Ridge, TN 37830, USA.
Bioinformatics. 2003 Oct 12;19(15):1952-63. doi: 10.1093/bioinformatics/btg266.
Experimental methods capable of generating sets of co-regulated genes have become commonplace, however, recognizing the regulatory motifs responsible for this regulation remains difficult. As a result, computational detection of transcription factor binding sites in such data sets has been an active area of research. Most approaches have utilized either Gibbs sampling or greedy strategies to identify such elements in sets of sequences. These existing methods have varying degrees of success depending on the strength and length of the signals and the number of available sequences. We present a new deterministic iterative algorithm for regulatory element detection based on a Markov chain background. As in other methods, sequences in the entire genome and the training set are taken into account in order to discriminate against commonly occurring signals and produce patterns, which are significant in the training set.
The results of the algorithm compare favorably with existing tools on previously known and newly compiled data sets. The iteration based search appears rather rigorous, not only finding the binding sites, but also showing how the binding site stands out from genomic background. The approach used to score the results is critical and a discussion of various scoring schemes and options is also presented. Benchmarking of several methods shows that while most tools are good at detecting strong signals, Gibbs sampling algorithms give inconsistent results when the regulatory element signal becomes weak. A Markov chain based background model alleviates the drawbacks of MAP (maximum a posteriori log likelihood) scores.
Available on request from the authors.
Data and the results presented in this paper are available on the web at http://compbio.ornl.gov/mira/index.html
能够生成共调控基因集的实验方法已变得很常见,然而,识别负责这种调控的调控基序仍然很困难。因此,在此类数据集中对转录因子结合位点进行计算检测一直是一个活跃的研究领域。大多数方法都利用吉布斯采样或贪心策略来在序列集中识别此类元件。这些现有方法根据信号的强度和长度以及可用序列的数量有不同程度的成功。我们提出了一种基于马尔可夫链背景的用于调控元件检测的新确定性迭代算法。与其他方法一样,考虑整个基因组和训练集中的序列,以便区分常见信号并生成在训练集中具有显著性的模式。
该算法的结果与现有工具在先前已知和新编译的数据集上相比具有优势。基于迭代的搜索显得相当严格,不仅能找到结合位点,还能展示结合位点如何从基因组背景中凸显出来。用于对结果进行评分的方法很关键,本文还对各种评分方案和选项进行了讨论。几种方法的基准测试表明,虽然大多数工具擅长检测强信号,但当调控元件信号变弱时,吉布斯采样算法给出的结果不一致。基于马尔可夫链的背景模型减轻了最大后验对数似然(MAP)评分的缺点。
可向作者索取。
本文中呈现的数据和结果可在网页http://compbio.ornl.gov/mira/index.html上获取