Rydbeck Halfdan, Sandve Geir Kjetil, Ferkingstad Egil, Simovski Boris, Rye Morten, Hovig Eivind
Department of Informatics, University of Oslo, Oslo, Norway; Department of Tumour Biology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway.
Department of Informatics, University of Oslo, Oslo, Norway.
PLoS One. 2015 Apr 16;10(4):e0123261. doi: 10.1371/journal.pone.0123261. eCollection 2015.
Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/.
聚类是一种用于数据探索性分析的常用技术,因为它能够以无监督的方式揭示数据中的子分组和相似性。虽然聚类经常应用于基因表达数据,但对于序列水平的基因组和表观基因组数据(例如基于染色质免疫沉淀的数据)的聚类,缺乏合适的通用方法。我们在此介绍一种用于对相对于基因组组装的坐标数据集(即基因组轨迹)进行聚类的通用方法。通过定义适当的特征提取方法和相似性度量,我们允许使用标准聚类算法对基因组轨迹进行具有生物学意义的聚类。该方法通过一个名为ClusTrack的工具来实现,它允许通过基于网络的界面指定微调的聚类分析。我们将我们的方法应用于来自一系列不同细胞类型样本中H3K4me1组蛋白修饰占据情况的聚类。大多数样本形成了有意义的子聚类,证实了特征和相似性的定义捕捉了基因组轨迹之间的生物学差异,而非技术差异。输入数据和结果可通过http://hyperbrowser.uio.no/hb/u/hb - superuser/p/clustrack上的Galaxy Pages文档获取并重现。聚类功能作为Galaxy工具,可在基因组超级浏览器服务器(http://hyperbrowser.uio.no/hb/)的菜单选项“轨迹的专门分析”以及子菜单选项“基于基因组水平相似性对轨迹进行聚类”中使用。