Department of Biology and Biotech Research and Innovation Centre, The Bioinformatics Centre, Copenhagen University, Copenhagen, Denmark.
PLoS One. 2011;6(8):e23409. doi: 10.1371/journal.pone.0023409. Epub 2011 Aug 24.
Genome-wide, high-throughput methods for transcription start site (TSS) detection have shown that most promoters have an array of neighboring TSSs where some are used more than others, forming a distribution of initiation propensities. TSS distributions (TSSDs) vary widely between promoters and earlier studies have shown that the TSSDs have biological implications in both regulation and function. However, no systematic study has been made to explore how many types of TSSDs and by extension core promoters exist and to understand which biological features distinguish them. In this study, we developed a new non-parametric dissimilarity measure and clustering approach to explore the similarities and stabilities of clusters of TSSDs. Previous studies have used arbitrary thresholds to arrive at two general classes: broad and sharp. We demonstrated that in addition to the previous broad/sharp dichotomy an additional category of promoters exists. Unlike typical TATA-driven sharp TSSDs where the TSS position can vary a few nucleotides, in this category virtually all TSSs originate from the same genomic position. These promoters lack epigenetic signatures of typical mRNA promoters and a substantial subset of them are mapping upstream of ribosomal protein pseudogenes. We present evidence that these are likely mapping errors, which have confounded earlier analyses, due to the high similarity of ribosomal gene promoters in combination with known G addition bias in the CAGE libraries. Thus, previous two-class separations of promoter based on TSS distributions are motivated, but the ultra-sharp TSS distributions will confound downstream analyses if not removed.
全基因组高通量转录起始位点(TSS)检测方法表明,大多数启动子具有一系列邻近的 TSS,其中一些 TSS 比其他 TSS 更常被使用,从而形成了起始倾向性的分布。TSS 分布(TSSD)在启动子之间差异很大,早期研究表明,TSSD 在调控和功能方面都具有生物学意义。然而,尚未进行系统研究来探索存在多少种 TSSD 以及核心启动子,以及了解哪些生物学特征可以区分它们。在这项研究中,我们开发了一种新的非参数相似度测量和聚类方法来探索 TSSD 聚类的相似性和稳定性。以前的研究使用任意阈值将 TSSD 分为两类:宽分布和窄分布。我们证明,除了以前的宽/窄二分法之外,还存在另一种类型的启动子。与典型的 TATA 驱动的窄 TSSD 不同,后者的 TSS 位置可能会发生几个核苷酸的变化,在这种类型中,几乎所有的 TSS 都源自相同的基因组位置。这些启动子缺乏典型 mRNA 启动子的表观遗传特征,其中很大一部分位于核糖体蛋白假基因的上游。我们提供的证据表明,这些可能是由于核糖体基因启动子的高度相似性以及 CAGE 文库中已知的 G 加偏置,导致了可能的映射错误,这混淆了早期的分析。因此,基于 TSS 分布的启动子的两分类分离是有动机的,但是如果不删除,超窄 TSS 分布可能会混淆下游分析。