CodingMotif：编码序列中过表达核苷酸基序的精确确定。

CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

机构信息

Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

出版信息

BMC Bioinformatics. 2012 Feb 14;13:32. doi: 10.1186/1471-2105-13-32.

DOI:10.1186/1471-2105-13-32

PMID:22333114

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3298695/

Abstract

BACKGROUND

It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.

RESULTS

We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.

CONCLUSIONS

CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.

摘要

背景

人们越来越认识到，编码序列除了编码蛋白质外，还含有调节序列基序。这些序列基序预计在与常见蛋白质或小 RNA 结合的核苷酸序列中过度表达。然而，由于蛋白质水平的限制，检测过度表达的基序一直很困难。基于密码子改组的基于抽样的方法来解决这个问题，仅限于探索序列空间的无穷小部分，并且使用参数近似。

结果

我们提出了一种新颖的 O（N（log N）2）-时间算法，CodingMotif，用于识别蛋白质编码区中异常拷贝数的核苷酸水平基序。使用新的动态编程算法，我们能够详尽地计算给定密码子使用和二核苷酸偏倚的背景模型下，给定基序在所有可能编码相同氨基酸序列的编码序列中出现次数的分布。我们的方法利用了给定基序可以出现的位点的稀疏性，大大加快了所需卷积计算的速度。对分布的了解可以评估给定基序是否过度或不足的精确非参数 p 值。我们证明，我们的方法在各种大小的各种编码数据集（包括转录因子 NRSF 和 GABP 的 ChIP-seq 数据）中比抽样和基于参数的方法更准确地识别已知功能基序。

结论

CodingMotif 为检测编码序列中过度表达的基序提供了理论和经验上的进展。我们预计 CodingMotif 将有助于识别功能基因组数据集（如 DNA-蛋白质结合、RNA-蛋白质结合或编码区中的 microRNA-RNA 结合）中的基序。软件实现可在 http://bioinformatics.bc.edu/chuanglab/codingmotif.tar 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cd8/3298695/e56bb28cea20/1471-2105-13-32-1.jpg

相似文献

CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

BMC Bioinformatics. 2012 Feb 14;13:32. doi: 10.1186/1471-2105-13-32.

A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs.

BMC Bioinformatics. 2012 Nov 27;13:317. doi: 10.1186/1471-2105-13-317.

Differential motif enrichment analysis of paired ChIP-seq experiments.

BMC Genomics. 2014 Sep 2;15(1):752. doi: 10.1186/1471-2164-15-752.

A fast weak motif-finding algorithm based on community detection in graphs.

BMC Bioinformatics. 2013 Jul 17;14:227. doi: 10.1186/1471-2105-14-227.

MEME-ChIP: motif analysis of large DNA datasets.

Bioinformatics. 2011 Jun 15;27(12):1696-7. doi: 10.1093/bioinformatics/btr189. Epub 2011 Apr 12.

New tools to analyze overlapping coding regions.

BMC Bioinformatics. 2016 Dec 13;17(1):530. doi: 10.1186/s12859-016-1389-7.

Set cover-based methods for motif selection.

Bioinformatics. 2020 Feb 15;36(4):1044-1051. doi: 10.1093/bioinformatics/btz697.

EXTREME: an online EM algorithm for motif discovery.

Bioinformatics. 2014 Jun 15;30(12):1667-73. doi: 10.1093/bioinformatics/btu093. Epub 2014 Feb 14.

Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space.

PLoS One. 2015 Oct 14;10(10):e0140557. doi: 10.1371/journal.pone.0140557. eCollection 2015.

COPS: detecting co-occurrence and spatial arrangement of transcription factor binding motifs in genome-wide datasets.

PLoS One. 2012;7(12):e52055. doi: 10.1371/journal.pone.0052055. Epub 2012 Dec 18.

引用本文的文献

REST Is Not Resting: REST/NRSF in Health and Disease.

Biomolecules. 2023 Oct 2;13(10):1477. doi: 10.3390/biom13101477.

Protein expression/secretion boost by a novel unique 21-mer cis-regulatory motif (Exin21) via mRNA stabilization.

Mol Ther. 2023 Apr 5;31(4):1136-1158. doi: 10.1016/j.ymthe.2023.02.012. Epub 2023 Feb 14.

MADS-Box Gene Classification in Angiosperms by Clustering and Machine Learning Approaches.

Front Genet. 2019 Jan 8;9:707. doi: 10.3389/fgene.2018.00707. eCollection 2018.

Dynamics of promoter bivalency and RNAP II pausing in mouse stem and differentiated cells.

BMC Dev Biol. 2018 Feb 20;18(1):2. doi: 10.1186/s12861-018-0163-7.

DistAMo: A Web-Based Tool to Characterize DNA-Motif Distribution on Bacterial Chromosomes.

Front Microbiol. 2016 Mar 11;7:283. doi: 10.3389/fmicb.2016.00283. eCollection 2016.

本文引用的文献

Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes.

Genome Res. 2011 Nov;21(11):1916-28. doi: 10.1101/gr.108753.110. Epub 2011 Oct 12.

Quantitative evaluation of all hexamers as exonic splicing elements.

Genome Res. 2011 Aug;21(8):1360-74. doi: 10.1101/gr.119628.110. Epub 2011 Jun 9.

Overlapping codes within protein-coding sequences.

Genome Res. 2010 Nov;20(11):1582-9. doi: 10.1101/gr.105072.110. Epub 2010 Sep 14.

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Genome Biol. 2010;11(8):R86. doi: 10.1186/gb-2010-11-8-r86. Epub 2010 Aug 25.

Conserved microRNA targeting in Drosophila is as widespread in coding regions as in 3'UTRs.

Proc Natl Acad Sci U S A. 2010 Sep 7;107(36):15751-6. doi: 10.1073/pnas.1006172107. Epub 2010 Aug 20.

COMIT: identification of noncoding motifs under selection in coding sequences.

Genome Biol. 2009;10(11):R133. doi: 10.1186/gb-2009-10-11-r133. Epub 2009 Nov 20.

Prevalence of transcription promoters within archaeal operons and coding sequences.

Mol Syst Biol. 2009;5:285. doi: 10.1038/msb.2009.42. Epub 2009 Jun 16.

Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps.

Nature. 2009 Jul 23;460(7254):479-86. doi: 10.1038/nature08170. Epub 2009 Jun 17.

Diversity and complexity in DNA recognition by transcription factors.

Science. 2009 Jun 26;324(5935):1720-3. doi: 10.1126/science.1162327. Epub 2009 May 14.

MicroRNAs: target recognition and regulatory functions.

Cell. 2009 Jan 23;136(2):215-33. doi: 10.1016/j.cell.2009.01.002.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

CodingMotif：编码序列中过表达核苷酸基序的精确确定。

CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献