Kel-Margoulis Olga V, Tchekmenev Dmitri, Kel Alexander E, Goessling Ellen, Hornischer Klaus, Lewicki-Potapov Birgit, Wingender Edgar
BIOBASE GmbH, Halchtersche Str 33, D-38304 Wolfenbuettel, Germany.
In Silico Biol. 2003;3(1-2):145-71. Epub 2003 Jun 27.
Known transcription regulatory signals which generally act as transcription factor binding sites (TFs) differ significantly in their base composition. Therefore, their occurrence in a genome largely depends on the local base composition. In an attempt to initiate an all human genome analysis for the occurrence of potential TFs, we systematically analyzed the GC-content of distinct functional regions (e. g., upstream and downstream gene regions, exons, long and short introns, repetitive elements) and correlated the frequencies of potential binding sites of a representative set of TFs in these regions. For these analyses, we used the pattern collection of the TRANSFAC database on transcriptional regulation, the information about functionally relevant combinations of them from the database TRANSCompel, and our new resource, TRANSGenomeTM, which provides an overall annotation of the human genome with emphasis on its regulatory characteristics. We show that the occurrence of sequence patterns with regulatory potential may be supported by, but cannot be fully explained by either the GC content of a whole chromosome or its putative promoter regions, nor by the information content of the patterns. Several patterns, HNF-3, NFAT, and GC box, show a clear overrepresentation in all promoter groups as well as in all chromosomes. Other patterns, like E2F and CRE-BP1, are underrepresented in all promoter groups as well as in all chromosomes in comparison with random sequences. Simultaneously, both patterns are over-represented in promoters in comparison with repetitive elements. We define several structural characteristics of the proximal promoters that differentiate them from other functional genomic regions. Two well-known promoter elements, GC- and TATA-boxes, are statistically enriched in promoters in comparison with random sequences, repetitive elements and exons. Altogether, our findings provide insights into the macroheterogeneity amongst the individual chromosomes, into the microheterogeneity among different functional regions of individual chromosomes, contribute to further understanding of structural organization of gene regulatory regions, and give first hints on the development of regulatory features during evolution.
已知通常作为转录因子结合位点(TFs)发挥作用的转录调控信号在碱基组成上存在显著差异。因此,它们在基因组中的出现很大程度上取决于局部碱基组成。为了启动对潜在转录因子在全人类基因组中出现情况的分析,我们系统地分析了不同功能区域(如基因的上游和下游区域、外显子、长内含子和短内含子、重复元件)的GC含量,并将这些区域中一组代表性转录因子潜在结合位点的频率进行了关联分析。对于这些分析,我们使用了TRANSFAC数据库中关于转录调控的模式集合、TRANSCompel数据库中它们功能相关组合的信息,以及我们的新资源TRANSGenomeTM,该资源提供了人类基因组的全面注释,重点是其调控特征。我们表明,具有调控潜力的序列模式的出现可能受到整条染色体或其假定启动子区域的GC含量的支持,但不能完全由其解释,也不能由模式的信息含量完全解释。几种模式,如肝细胞核因子3(HNF-3)、活化T细胞核因子(NFAT)和GC框,在所有启动子组以及所有染色体中都明显过度富集。与随机序列相比,其他模式,如E2F和环磷腺苷反应元件结合蛋白1(CRE-BP1),在所有启动子组以及所有染色体中都表达不足。同时,与重复元件相比,这两种模式在启动子中都过度富集。我们定义了近端启动子的几个结构特征,这些特征将它们与其他功能基因组区域区分开来。与随机序列、重复元件和外显子相比,两个著名的启动子元件,即GC盒和TATA盒,在启动子中具有统计学上的富集。总之,我们的发现为各条染色体之间的宏观异质性、各条染色体不同功能区域之间的微观异质性提供了见解,有助于进一步理解基因调控区域的结构组织,并为进化过程中调控特征的发展提供了初步线索。