Carlson Jonathan M, Chakravarty Arijit, Gross Robert H
Department of Biology, Dartmouth College, Hanover, NH 03755, USA.
J Comput Biol. 2006 Apr;13(3):686-701. doi: 10.1089/cmb.2006.13.686.
The identification of potential protein binding sites (cis-regulatory elements) in the upstream regions of genes is key to understanding the mechanisms that regulate gene expression. To this end, we present a simple, efficient algorithm, BEAM (beam-search enumerative algorithm for motif finding), aimed at the discovery of cis-regulatory elements in the DNA sequences upstream of a related group of genes. This algorithm dramatically limits the search space of expanded sequences, converting the problem from one that is exponential in the length of motifs sought to one that is linear. Unlike sampling algorithms, our algorithm converges and is capable of finding statistically overrepresented motifs with a low failure rate. Further, our algorithm is not dependent on the objective function or the organism used. Limiting the space of candidate motifs enables the algorithm to focus only on those motifs that are most likely to be biologically relevant and enables the algorithm to use direct evaluations of background frequencies instead of resorting to probabilistic estimates. In addition, limiting the space of candidate motifs makes it possible to use computationally expensive objective functions that are able to correctly identify biologically relevant motifs.
识别基因上游区域潜在的蛋白质结合位点(顺式调控元件)是理解基因表达调控机制的关键。为此,我们提出了一种简单、高效的算法——BEAM(用于基序查找的束搜索枚举算法),旨在发现一组相关基因上游DNA序列中的顺式调控元件。该算法极大地限制了扩展序列的搜索空间,将问题从一个与所寻找基序长度呈指数关系的问题转变为线性问题。与抽样算法不同,我们的算法能够收敛,并且能够以低失败率找到统计学上过度出现的基序。此外,我们的算法不依赖于目标函数或所使用的生物体。限制候选基序的空间使算法能够仅关注那些最有可能具有生物学相关性的基序,并使算法能够使用背景频率的直接评估,而不是依赖概率估计。此外,限制候选基序的空间使得使用能够正确识别生物学相关基序的计算成本高昂的目标函数成为可能。