ClusTrack：用于全基因组数据集聚类的特征提取与相似性度量

ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

作者信息

Rydbeck Halfdan, Sandve Geir Kjetil, Ferkingstad Egil, Simovski Boris, Rye Morten, Hovig Eivind

机构信息

Department of Informatics, University of Oslo, Oslo, Norway; Department of Tumour Biology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway.

Department of Informatics, University of Oslo, Oslo, Norway.

出版信息

PLoS One. 2015 Apr 16;10(4):e0123261. doi: 10.1371/journal.pone.0123261. eCollection 2015.

DOI:10.1371/journal.pone.0123261

PMID:25879845

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4400084/

Abstract

Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/.

摘要

聚类是一种用于数据探索性分析的常用技术，因为它能够以无监督的方式揭示数据中的子分组和相似性。虽然聚类经常应用于基因表达数据，但对于序列水平的基因组和表观基因组数据（例如基于染色质免疫沉淀的数据）的聚类，缺乏合适的通用方法。我们在此介绍一种用于对相对于基因组组装的坐标数据集（即基因组轨迹）进行聚类的通用方法。通过定义适当的特征提取方法和相似性度量，我们允许使用标准聚类算法对基因组轨迹进行具有生物学意义的聚类。该方法通过一个名为ClusTrack的工具来实现，它允许通过基于网络的界面指定微调的聚类分析。我们将我们的方法应用于来自一系列不同细胞类型样本中H3K4me1组蛋白修饰占据情况的聚类。大多数样本形成了有意义的子聚类，证实了特征和相似性的定义捕捉了基因组轨迹之间的生物学差异，而非技术差异。输入数据和结果可通过http://hyperbrowser.uio.no/hb/u/hb - superuser/p/clustrack上的Galaxy Pages文档获取并重现。聚类功能作为Galaxy工具，可在基因组超级浏览器服务器（http://hyperbrowser.uio.no/hb/）的菜单选项“轨迹的专门分析”以及子菜单选项“基于基因组水平相似性对轨迹进行聚类”中使用。

相似文献

ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.ClusTrack：用于全基因组数据集聚类的特征提取与相似性度量

PLoS One. 2015 Apr 16;10(4):e0123261. doi: 10.1371/journal.pone.0123261. eCollection 2015.

The Genomic HyperBrowser: an analysis web server for genome-scale data.基因组超浏览器：一个用于基因组规模数据的分析网络服务器。

Nucleic Acids Res. 2013 Jul;41(Web Server issue):W133-41. doi: 10.1093/nar/gkt342. Epub 2013 Apr 30.

GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome.GSuite HyperBrowser：跨基因组和表观基因组数据集集合的综合分析。

Gigascience. 2017 Jul 1;6(7):1-12. doi: 10.1093/gigascience/gix032.

hGSuite HyperBrowser: A web-based toolkit for hierarchical metadata-informed analysis of genomic tracks.hGSuite HyperBrowser：一个基于网络的工具包，用于基于层次元数据的基因组轨迹信息分析。

PLoS One. 2023 Jul 19;18(7):e0286330. doi: 10.1371/journal.pone.0286330. eCollection 2023.

Sequential Monte Carlo multiple testing.序贯蒙特卡罗多重检验。

Bioinformatics. 2011 Dec 1;27(23):3235-41. doi: 10.1093/bioinformatics/btr568. Epub 2011 Oct 13.

The Genomic HyperBrowser: inferential genomics at the sequence level.基因组超浏览器：序列水平的推理基因组学。

Genome Biol. 2010;11(12):R121. doi: 10.1186/gb-2010-11-12-r121. Epub 2010 Dec 23.

caBIG VISDA: modeling, visualization, and discovery for cluster analysis of genomic data.caBIG VISDA：用于基因组数据聚类分析的建模、可视化与发现

BMC Bioinformatics. 2008 Sep 18;9:383. doi: 10.1186/1471-2105-9-383.

The differential disease regulome.差异化疾病调控网络。

BMC Genomics. 2011 Jul 7;12:353. doi: 10.1186/1471-2164-12-353.

A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets.一种用于状态空间推理和聚类的MAD-贝叶斯算法及其在查询大量ChIP-Seq数据集方面的应用

J Comput Biol. 2017 Jun;24(6):472-485. doi: 10.1089/cmb.2016.0138. Epub 2016 Nov 11.

Annotation-based distance measures for patient subgroup discovery in clinical microarray studies.临床微阵列研究中用于发现患者亚组的基于注释的距离度量。

Bioinformatics. 2007 Sep 1;23(17):2256-64. doi: 10.1093/bioinformatics/btm322. Epub 2007 Jun 22.

本文引用的文献

The Genomic HyperBrowser: an analysis web server for genome-scale data.基因组超浏览器：一个用于基因组规模数据的分析网络服务器。

Nucleic Acids Res. 2013 Jul;41(Web Server issue):W133-41. doi: 10.1093/nar/gkt342. Epub 2013 Apr 30.

ENCODE data in the UCSC Genome Browser: year 5 update.在 UCSC 基因组浏览器中编码数据：第 5 年更新。

Nucleic Acids Res. 2013 Jan;41(Database issue):D56-63. doi: 10.1093/nar/gks1172. Epub 2012 Nov 27.

Spark: a navigational paradigm for genomic data exploration.Spark：一种用于基因组数据探索的导航范例。

Genome Res. 2012 Nov;22(11):2262-9. doi: 10.1101/gr.140665.112. Epub 2012 Sep 7.

Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements.染色质环境在调控元件处的普遍异质性和不对称性。

Genome Res. 2012 Sep;22(9):1735-47. doi: 10.1101/gr.136366.111.

Architecture of the human regulatory network derived from ENCODE data.人类调控网络的结构源自 ENCODE 数据。

Nature. 2012 Sep 6;489(7414):91-100. doi: 10.1038/nature11245.

An integrated encyclopedia of DNA elements in the human genome.人类基因组中 DNA 元件的综合百科全书。

Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247.

The axonal transport of mitochondria.线粒体的轴突运输。

J Cell Sci. 2012 May 1;125(Pt 9):2095-104. doi: 10.1242/jcs.053850. Epub 2012 May 22.

PtdIns (3,4,5) P3 recruitment of Myo10 is essential for axon development.PtdIns（3,4,5）P3 募集 Myo10 对于轴突发育是必不可少的。

PLoS One. 2012;7(5):e36988. doi: 10.1371/journal.pone.0036988. Epub 2012 May 10.

Unsupervised pattern discovery in human chromatin structure through genomic segmentation.通过基因组分割实现人类染色质结构的无监督模式发现。

Nat Methods. 2012 Mar 18;9(5):473-6. doi: 10.1038/nmeth.1937.

Chromatin states accurately classify cell differentiation stages.染色质状态能准确地对细胞分化阶段进行分类。

PLoS One. 2012;7(2):e31414. doi: 10.1371/journal.pone.0031414. Epub 2012 Feb 20.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

ClusTrack：用于全基因组数据集聚类的特征提取与相似性度量

ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献