Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.
BMC Bioinformatics. 2012 Feb 14;13:32. doi: 10.1186/1471-2105-13-32.
It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.
We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.
CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.
人们越来越认识到,编码序列除了编码蛋白质外,还含有调节序列基序。这些序列基序预计在与常见蛋白质或小 RNA 结合的核苷酸序列中过度表达。然而,由于蛋白质水平的限制,检测过度表达的基序一直很困难。基于密码子改组的基于抽样的方法来解决这个问题,仅限于探索序列空间的无穷小部分,并且使用参数近似。
我们提出了一种新颖的 O(N(log N)2)-时间算法,CodingMotif,用于识别蛋白质编码区中异常拷贝数的核苷酸水平基序。使用新的动态编程算法,我们能够详尽地计算给定密码子使用和二核苷酸偏倚的背景模型下,给定基序在所有可能编码相同氨基酸序列的编码序列中出现次数的分布。我们的方法利用了给定基序可以出现的位点的稀疏性,大大加快了所需卷积计算的速度。对分布的了解可以评估给定基序是否过度或不足的精确非参数 p 值。我们证明,我们的方法在各种大小的各种编码数据集(包括转录因子 NRSF 和 GABP 的 ChIP-seq 数据)中比抽样和基于参数的方法更准确地识别已知功能基序。
CodingMotif 为检测编码序列中过度表达的基序提供了理论和经验上的进展。我们预计 CodingMotif 将有助于识别功能基因组数据集(如 DNA-蛋白质结合、RNA-蛋白质结合或编码区中的 microRNA-RNA 结合)中的基序。软件实现可在 http://bioinformatics.bc.edu/chuanglab/codingmotif.tar 获得。