快速树聚合的共识层次聚类。

Fast tree aggregation for consensus hierarchical clustering.

机构信息

Université Paris-Saclay, INRAE, AgroParisTech, GABI, Jouy-en-Josas, 78350, France.

Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA-Paris, Paris, 75005, France.

出版信息

BMC Bioinformatics. 2020 Mar 20;21(1):120. doi: 10.1186/s12859-020-3453-6.

DOI:10.1186/s12859-020-3453-6

PMID:32197576

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7085155/

Abstract

BACKGROUND

In unsupervised learning and clustering, data integration from different sources and types is a difficult question discussed in several research areas. For instance in omics analysis, dozen of clustering methods have been developed in the past decade. When a single source of data is at play, hierarchical clustering (HC) is extremely popular, as a tree structure is highly interpretable and arguably more informative than just a partition of the data. However, applying blindly HC to multiple sources of data raises computational and interpretation issues.

RESULTS

We propose mergeTrees, a method that aggregates a set of trees with the same leaves to create a consensus tree. In our consensus tree, a cluster at height h contains the individuals that are in the same cluster for all the trees at height h. The method is exact and proven to be [Formula: see text], n being the individuals and q being the number of trees to aggregate. Our implementation is extremely effective on simulations, allowing us to process many large trees at a time. We also rely on mergeTrees to perform the cluster analysis of two real -omics data sets, introducing a spectral variant as an efficient and robust by-product.

CONCLUSIONS

Our tree aggregation method can be used in conjunction with hierarchical clustering to perform efficient cluster analysis. This approach was found to be robust to the absence of clustering information in some of the data sets as well as an increased variability within true clusters. The method is implemented in R/C++ and available as an R package named mergeTrees, which makes it easy to integrate in existing or new pipelines in several research areas.

摘要

背景

在无监督学习和聚类中，来自不同来源和类型的数据集成是几个研究领域讨论的难题。例如，在组学分析中，过去十年已经开发了数十种聚类方法。当只有单一数据源时，层次聚类 (HC) 非常流行，因为树结构具有高度的可解释性，并且可以说比数据的简单分区提供更多信息。然而，盲目地将 HC 应用于多个数据源会引发计算和解释问题。

结果

我们提出了 mergeTrees，一种聚合具有相同叶子的一组树以创建共识树的方法。在我们的共识树中，高度为 h 的聚类包含所有高度为 h 的树中处于同一聚类的个体。该方法是精确的，并被证明是[公式：见正文]，n 是个体，q 是要聚合的树的数量。我们的实现对于模拟非常有效，允许我们一次处理许多大型树。我们还依靠 mergeTrees 对两个真实的组学数据集进行聚类分析，引入了一种谱变体作为高效且稳健的副产品。

结论

我们的树聚合方法可与层次聚类结合使用，以执行高效的聚类分析。该方法被发现对某些数据集缺乏聚类信息以及真实聚类内的变异性增加具有稳健性。该方法在 R/C++ 中实现，并作为一个名为 mergeTrees 的 R 包提供，这使得它易于在几个研究领域中的现有或新管道中集成。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48fc/7085155/2cc3438dcf65/12859_2020_3453_Fig1_HTML.jpg

相似文献

Fast tree aggregation for consensus hierarchical clustering.快速树聚合的共识层次聚类。

BMC Bioinformatics. 2020 Mar 20;21(1):120. doi: 10.1186/s12859-020-3453-6.

Semi-supervised adaptive-height snipping of the hierarchical clustering tree.层次聚类树的半监督自适应高度剪枝

BMC Bioinformatics. 2015 Jan 16;16(1):15. doi: 10.1186/s12859-014-0448-1.

HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree.HCsnip：用于层次聚类树半监督剪枝的R包。

Cancer Inform. 2015 Mar 22;14:1-19. doi: 10.4137/CIN.S22080. eCollection 2015.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

A new fast method for inferring multiple consensus trees using k-medoids.一种利用 k -medoids 快速推断多个一致树的新方法。

BMC Evol Biol. 2018 Apr 5;18(1):48. doi: 10.1186/s12862-018-1163-8.

Invariant transformers of Robinson and Foulds distance matrices for Convolutional Neural Network.不变的 Robinson 和 Foulds 距离矩阵变换用于卷积神经网络。

J Bioinform Comput Biol. 2022 Aug;20(4):2250012. doi: 10.1142/S0219720022500123. Epub 2022 Jul 6.

K-ary clustering with optimal leaf ordering for gene expression data.用于基因表达数据的具有最优叶排序的K元聚类

Bioinformatics. 2003 Jun 12;19(9):1070-8. doi: 10.1093/bioinformatics/btg030.

densityCut: an efficient and versatile topological approach for automatic clustering of biological data.密度切割：一种用于生物数据自动聚类的高效且通用的拓扑方法。

Bioinformatics. 2016 Sep 1;32(17):2567-76. doi: 10.1093/bioinformatics/btw227. Epub 2016 Apr 23.

Selection of informative clusters from hierarchical cluster tree with gene classes.从带有基因类别的层次聚类树中选择信息性聚类

BMC Bioinformatics. 2004 Mar 25;5:32. doi: 10.1186/1471-2105-5-32.

Consensus clustering with missing labels (ccml): a consensus clustering tool for multi-omics integrative prediction in cohorts with unequal sample coverage.共识聚类与缺失标签 (ccml)：一种用于在样本覆盖不均衡的队列中进行多组学综合预测的共识聚类工具。

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad501.

引用本文的文献

Semi-supervised clustering of quaternion time series: Application to gait analysis in multiple sclerosis using motion sensor data.四元数时间序列的半监督聚类：在使用运动传感器数据的多发性硬化症步态分析中的应用。

Stat Med. 2023 Feb 20;42(4):433-456. doi: 10.1002/sim.9625. Epub 2022 Dec 12.

Machine learning for data integration in human gut microbiome.机器学习在人类肠道微生物组数据集成中的应用。

Microb Cell Fact. 2022 Nov 23;21(1):241. doi: 10.1186/s12934-022-01973-4.

Phenotype clustering in health care: A narrative review for clinicians.医疗保健中的表型聚类：给临床医生的叙述性综述

Front Artif Intell. 2022 Aug 12;5:842306. doi: 10.3389/frai.2022.842306. eCollection 2022.

A Clonogenic Assay to Quantify Melanoma Micrometastases in Pulmonary Tissue.一种用于量化肺组织中黑色素瘤微转移的克隆形成分析。

Methods Mol Biol. 2021;2265:385-406. doi: 10.1007/978-1-0716-1205-7_28.

Multi-omics analysis to examine microbiota, host gene expression and metabolites in the intestine of black tiger shrimp () with different growth performance.多组学分析以检测不同生长性能的黑虎虾肠道中的微生物群、宿主基因表达和代谢产物。

PeerJ. 2020 Aug 14;8:e9646. doi: 10.7717/peerj.9646. eCollection 2020.

本文引用的文献

Dynamic Visualization and Fast Computation for Convex Clustering via Algorithmic Regularization.通过算法正则化实现凸聚类的动态可视化与快速计算

J Comput Graph Stat. 2020;29(1):87-96. doi: 10.1080/10618600.2019.1629943. Epub 2019 Jul 19.

The Integrative Human Microbiome Project.整合人类微生物组计划。

Nature. 2019 May;569(7758):641-648. doi: 10.1038/s41586-019-1238-8. Epub 2019 May 29.

Autism spectrum disorder: insights into convergent mechanisms from transcriptomics.自闭症谱系障碍：从转录组学角度看趋同机制。

Nat Rev Genet. 2019 Jan;20(1):51-63. doi: 10.1038/s41576-018-0066-2.

Multi-omic and multi-view clustering algorithms: review and cancer benchmark.多组学和多视角聚类算法：综述和癌症基准测试。

Nucleic Acids Res. 2018 Nov 16;46(20):10546-10562. doi: 10.1093/nar/gky889.

More Is Better: Recent Progress in Multi-Omics Data Integration Methods.越多越好：多组学数据整合方法的最新进展

Front Genet. 2017 Jun 16;8:84. doi: 10.3389/fgene.2017.00084. eCollection 2017.

CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP.系统发育树的置信区间：一种使用自展法的方法。

Evolution. 1985 Jul;39(4):783-791. doi: 10.1111/j.1558-5646.1985.tb00420.x.

Multi-omics approaches to disease.疾病的多组学方法

Genome Biol. 2017 May 5;18(1):83. doi: 10.1186/s13059-017-1215-1.

Can We "Future-Proof" Consensus Trees?我们能让共识树“适应未来”吗？

Syst Biol. 2017 Jul 1;66(4):611-619. doi: 10.1093/sysbio/syx030.

Integrative epigenome-wide analysis demonstrates that DNA methylation may mediate genetic risk in inflammatory bowel disease.整合表观基因组全基因组分析表明，DNA 甲基化可能在炎症性肠病的遗传风险中起介导作用。

Nat Commun. 2016 Nov 25;7:13507. doi: 10.1038/ncomms13507.

Metabolomics in Prediabetes and Diabetes: A Systematic Review and Meta-analysis.糖尿病前期和糖尿病中的代谢组学：一项系统综述与荟萃分析。

Diabetes Care. 2016 May;39(5):833-46. doi: 10.2337/dc15-2251.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

快速树聚合的共识层次聚类。

Fast tree aggregation for consensus hierarchical clustering.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献