人类启动子序列中过度代表性词汇的统计分析。

Statistical analysis of over-represented words in human promoter sequences.

作者信息

Mariño-Ramírez Leonardo, Spouge John L, Kanga Gavin C, Landsman David

机构信息

Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA.

出版信息

Nucleic Acids Res. 2004 Feb 12;32(3):949-58. doi: 10.1093/nar/gkh246. Print 2004.

DOI:10.1093/nar/gkh246

PMID:14963262

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC373387/

Abstract

The identification and characterization of regulatory sequence elements in the proximal promoter region of a gene can be facilitated by knowing the precise location of the transcriptional start site (TSS). Using known TSSs from over 5700 different human full-length cDNAs, this study extracted a set of 4737 distinct putative promoter regions (PPRs) from the human genome. Each PPR consisted of nucleotides from -2000 to +1000 bp, relative to the corresponding TSS. Since many regulatory regions contain short, highly conserved strings of less than 10 nucleotides, we counted eight-letter words within the PPRs, using z-scores and other related statistics to evaluate their over- and under-representation. Several over-represented eight-letter words have known biological functions described in the eukaryotic transcription factor database TRANSFAC; however, many did not. Besides calculating a P-value with the standard normal approximation associated with z-scores, we used two extra statistical controls to evaluate the significance of over-represented words. These controls have important implications for evaluating over- and under-represented words with z-scores.

摘要

了解转录起始位点（TSS）的精确位置有助于识别和表征基因近端启动子区域中的调控序列元件。本研究利用来自5700多种不同人类全长cDNA的已知TSS，从人类基因组中提取了一组4737个不同的假定启动子区域（PPR）。每个PPR相对于相应的TSS，由-2000至+1000 bp的核苷酸组成。由于许多调控区域包含少于10个核苷酸的短的、高度保守的序列，我们在PPR内统计了八个字母的单词，使用z分数和其他相关统计量来评估它们的过度出现和不足出现情况。几个过度出现的八个字母的单词在真核转录因子数据库TRANSFAC中有已知的生物学功能描述；然而，许多没有。除了用与z分数相关的标准正态近似计算P值外，我们还使用了另外两种统计对照来评估过度出现的单词的显著性。这些对照对于用z分数评估过度出现和不足出现的单词具有重要意义。

相似文献

Statistical analysis of over-represented words in human promoter sequences.人类启动子序列中过度代表性词汇的统计分析。

Nucleic Acids Res. 2004 Feb 12;32(3):949-58. doi: 10.1093/nar/gkh246. Print 2004.

The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site.一些人类转录因子结合基序的生物学功能会因相对于转录起始位点的位置而异。

Nucleic Acids Res. 2008 May;36(8):2777-86. doi: 10.1093/nar/gkn137. Epub 2008 Mar 26.

Large-scale collection and characterization of promoters of human and mouse genes.人类和小鼠基因启动子的大规模收集与表征

In Silico Biol. 2004;4(4):429-44.

Conserved short sequences in promoter regions of human genome.人类基因组启动子区域中的保守短序列。

J Biomol Struct Dyn. 2010 Apr;27(5):599-610. doi: 10.1080/07391102.2010.10508574.

Genome-wide identification of transcription start sites, promoters and transcription factor binding sites in E. coli.在大肠杆菌中全基因组鉴定转录起始位点、启动子和转录因子结合位点。

PLoS One. 2009 Oct 19;4(10):e7526. doi: 10.1371/journal.pone.0007526.

Genomic structure and functional characterization of the promoter region of human IkappaB kinase-related kinase IKKi/IKKvarepsilon gene.人类IkappaB激酶相关激酶IKKi/IKKε基因启动子区域的基因组结构与功能特征

Gene. 2005 Jun 20;353(1):118-33. doi: 10.1016/j.gene.2005.04.013.

GPMiner: an integrated system for mining combinatorial cis-regulatory elements in mammalian gene group.GPMiner：一个用于挖掘哺乳动物基因组合调控元件的集成系统。

BMC Genomics. 2012;13 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2164-13-S1-S3. Epub 2012 Jan 17.

Wide-scale analysis of human functional transcription factor binding reveals a strong bias towards the transcription start site.大规模分析人类功能转录因子结合揭示了强烈偏向转录起始位点的现象。

PLoS One. 2007 Aug 29;2(8):e807. doi: 10.1371/journal.pone.0000807.

Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome.果蝇基因组中单转录起始位点和替代转录起始位点的基序组成、保守性和条件特异性。

Genome Biol. 2009;10(7):R73. doi: 10.1186/gb-2009-10-7-r73. Epub 2009 Jul 9.

Genome-wide prediction of transcriptional regulatory elements of human promoters using gene expression and promoter analysis data.利用基因表达和启动子分析数据对人类启动子的转录调控元件进行全基因组预测。

BMC Bioinformatics. 2006 Jul 4;7:330. doi: 10.1186/1471-2105-7-330.

引用本文的文献

Formation and Recognition of UV-Induced DNA Damage within Genome Complexity.在基因组复杂性中形成和识别紫外线诱导的 DNA 损伤。

Int J Mol Sci. 2020 Sep 12;21(18):6689. doi: 10.3390/ijms21186689.

Structural basis of HapE-linked antifungal triazole resistance in .HapE 相关的抗真菌三唑耐药性的结构基础。

Life Sci Alliance. 2020 May 28;3(7). doi: 10.26508/lsa.202000729. Print 2020 Jul.

What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants.邻居们如何评价你：顺式调控模块的局部背景使调控变异的预测变得复杂。

Front Genet. 2019 Oct 31;10:1078. doi: 10.3389/fgene.2019.01078. eCollection 2019.

SNP and indel frequencies at transcription start sites and at canonical and alternative translation initiation sites in the human genome.人类基因组中转录起始位点、典型和非典型翻译起始位点的 SNP 和 indel 频率。

PLoS One. 2019 Apr 12;14(4):e0214816. doi: 10.1371/journal.pone.0214816. eCollection 2019.

Long Non-coding RNA: Characterizing the Locus Features by the Approaches.长链非编码RNA：通过这些方法表征基因座特征。

Genomics Inform. 2017 Dec;15(4):170-177. doi: 10.5808/GI.2017.15.4.170. Epub 2017 Dec 29.

Role of Y Box Protein-1 in cancer: As potential biomarker and novel therapeutic target.Y盒蛋白1在癌症中的作用：作为潜在的生物标志物和新型治疗靶点。

J Cancer. 2017 Jul 3;8(10):1900-1907. doi: 10.7150/jca.17689. eCollection 2017.

Most of the tight positional conservation of transcription factor binding sites near the transcription start site reflects their co-localization within regulatory modules.转录起始位点附近转录因子结合位点的大多数紧密位置保守性反映了它们在调控模块内的共定位。

BMC Bioinformatics. 2016 Nov 21;17(1):479. doi: 10.1186/s12859-016-1354-5.

Constitutive overexpression of the TaNF-YB4 gene in transgenic wheat significantly improves grain yield.TaNF-YB4基因在转基因小麦中的组成型过表达显著提高了籽粒产量。

J Exp Bot. 2015 Nov;66(21):6635-50. doi: 10.1093/jxb/erv370. Epub 2015 Jul 27.

Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models.模式重叠分析与模式出现次数的P值精确计算：隐马尔可夫模型的情况

Algorithms Mol Biol. 2014 Dec 16;9(1):25. doi: 10.1186/s13015-014-0025-1. eCollection 2014.

Determination and inference of eukaryotic transcription factor sequence specificity.真核转录因子序列特异性的测定和推断。

Cell. 2014 Sep 11;158(6):1431-1443. doi: 10.1016/j.cell.2014.08.009.

本文引用的文献

Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units.龙基因起始位点查找器：一种用于查找基因转录单元起始大致位置的先进系统。

Genome Res. 2003 Aug;13(8):1923-9. doi: 10.1101/gr.869803. Epub 2003 Jul 17.

Targeting a complex transcriptome: the construction of the mouse full-length cDNA encyclopedia.靶向复杂转录组：小鼠全长cDNA百科全书的构建。

Genome Res. 2003 Jun;13(6B):1273-89. doi: 10.1101/gr.1119703.

Finding functional features in Saccharomyces genomes by phylogenetic footprinting.通过系统发育足迹法在酿酒酵母基因组中寻找功能特征。

Science. 2003 Jul 4;301(5629):71-6. doi: 10.1126/science.1084337. Epub 2003 May 29.

Identification and functional analysis of human transcriptional promoters.人类转录启动子的鉴定与功能分析。

Genome Res. 2003 Feb;13(2):308-12. doi: 10.1101/gr.794803.

TRANSFAC: transcriptional regulation, from patterns to profiles.TRANSFAC：转录调控，从模式到图谱。

Nucleic Acids Res. 2003 Jan 1;31(1):374-8. doi: 10.1093/nar/gkg108.

Discovery of novel transcription factor binding sites by statistical overrepresentation.通过统计过度代表性发现新型转录因子结合位点。

Nucleic Acids Res. 2002 Dec 15;30(24):5549-60. doi: 10.1093/nar/gkf669.

Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences.超过15000条全长人类和小鼠cDNA序列的生成与初步分析。

Proc Natl Acad Sci U S A. 2002 Dec 24;99(26):16899-903. doi: 10.1073/pnas.242603899. Epub 2002 Dec 11.

TFBS: Computational framework for transcription factor binding site analysis.TFBS：转录因子结合位点分析的计算框架。

Bioinformatics. 2002 Aug;18(8):1135-6. doi: 10.1093/bioinformatics/18.8.1135.

Discovery of regulatory elements by a computational method for phylogenetic footprinting.通过系统发育足迹法的计算方法发现调控元件。

Genome Res. 2002 May;12(5):739-48. doi: 10.1101/gr.6902.

Comprehensive analysis of CpG islands in human chromosomes 21 and 22.对人类21号和22号染色体上CpG岛的综合分析。

Proc Natl Acad Sci U S A. 2002 Mar 19;99(6):3740-5. doi: 10.1073/pnas.052410099. Epub 2002 Mar 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验