Shimokawa Kazuro, Okamura-Oho Yuko, Kurita Takio, Frith Martin C, Kawai Jun, Carninci Piero, Hayashizaki Yoshihide
Genome Exploration Research Group, RIKEN Genomic Sciences Center, RIKEN Yokohama Institute, Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, Japan.
BMC Bioinformatics. 2007 May 21;8:161. doi: 10.1186/1471-2105-8-161.
Recent analyses have suggested that many genes possess multiple transcription start sites (TSSs) that are differentially utilized in different tissues and cell lines. We have identified a huge number of TSSs mapped onto the mouse genome using the cap analysis of gene expression (CAGE) method. The standard hierarchical clustering algorithm, which gives us easily understandable graphical tree images, has difficulties in processing such huge amounts of TSS data and a better method to calculate and display the results is needed.
We use a combination of hierarchical and non-hierarchical clustering to cluster expression profiles of TSSs based on a large amount of CAGE data to profit from the best of both methods. We processed the genome-wide expression data, including 159,075 TSSs derived from 127 RNA samples of various organs of mouse, and succeeded in categorizing them into 70-100 clusters. The clusters exhibited intriguing biological features: a cluster supergroup with a ubiquitous expression profile, tissue-specific patterns, a distinct distribution of non-coding RNA and functional TSS groups.
Our approach succeeded in greatly reducing the calculation cost, and is an appropriate solution for analyzing large-scale TSS usage data.
最近的分析表明,许多基因拥有多个转录起始位点(TSS),这些位点在不同组织和细胞系中被差异利用。我们使用基因表达的帽分析(CAGE)方法,在小鼠基因组上鉴定出了大量的TSS。标准的层次聚类算法虽然能给我们提供易于理解的树形图图像,但在处理如此大量的TSS数据时存在困难,因此需要一种更好的方法来计算和展示结果。
我们结合层次聚类和非层次聚类,基于大量CAGE数据对TSS的表达谱进行聚类,以充分利用两种方法的优点。我们处理了全基因组表达数据,其中包括来自小鼠各种器官的127个RNA样本中的159,075个TSS,并成功将它们分类为70 - 100个簇。这些簇呈现出有趣的生物学特征:一个具有普遍表达谱的簇超群、组织特异性模式、非编码RNA的独特分布以及功能性TSS组。
我们的方法成功地大幅降低了计算成本,是分析大规模TSS使用数据的合适解决方案。