Mariño-Ramírez Leonardo, Spouge John L, Kanga Gavin C, Landsman David
Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA.
Nucleic Acids Res. 2004 Feb 12;32(3):949-58. doi: 10.1093/nar/gkh246. Print 2004.
The identification and characterization of regulatory sequence elements in the proximal promoter region of a gene can be facilitated by knowing the precise location of the transcriptional start site (TSS). Using known TSSs from over 5700 different human full-length cDNAs, this study extracted a set of 4737 distinct putative promoter regions (PPRs) from the human genome. Each PPR consisted of nucleotides from -2000 to +1000 bp, relative to the corresponding TSS. Since many regulatory regions contain short, highly conserved strings of less than 10 nucleotides, we counted eight-letter words within the PPRs, using z-scores and other related statistics to evaluate their over- and under-representation. Several over-represented eight-letter words have known biological functions described in the eukaryotic transcription factor database TRANSFAC; however, many did not. Besides calculating a P-value with the standard normal approximation associated with z-scores, we used two extra statistical controls to evaluate the significance of over-represented words. These controls have important implications for evaluating over- and under-represented words with z-scores.
了解转录起始位点(TSS)的精确位置有助于识别和表征基因近端启动子区域中的调控序列元件。本研究利用来自5700多种不同人类全长cDNA的已知TSS,从人类基因组中提取了一组4737个不同的假定启动子区域(PPR)。每个PPR相对于相应的TSS,由-2000至+1000 bp的核苷酸组成。由于许多调控区域包含少于10个核苷酸的短的、高度保守的序列,我们在PPR内统计了八个字母的单词,使用z分数和其他相关统计量来评估它们的过度出现和不足出现情况。几个过度出现的八个字母的单词在真核转录因子数据库TRANSFAC中有已知的生物学功能描述;然而,许多没有。除了用与z分数相关的标准正态近似计算P值外,我们还使用了另外两种统计对照来评估过度出现的单词的显著性。这些对照对于用z分数评估过度出现和不足出现的单词具有重要意义。