对具有混合型数据的样本和变量进行聚类。

Clustering of samples and variables with mixed-type data.

作者信息

Hummel Manuela, Edelmann Dominic, Kopp-Schneider Annette

机构信息

Division of Biostatistics, German Cancer Research Center, Heidelberg, Germany.

出版信息

PLoS One. 2017 Nov 28;12(11):e0188274. doi: 10.1371/journal.pone.0188274. eCollection 2017.

DOI:10.1371/journal.pone.0188274

PMID:29182671

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5705083/

Abstract

Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.

摘要

对不同尺度上测量的数据进行分析是一项颇具挑战性的任务。生物医学研究通常聚焦于高通量数据集，例如定量测量数据。然而，整合可能在不同尺度上测量的其他特征（例如临床或细胞遗传学因素）的需求变得越来越重要。然后将分析结果（例如相关基因的选择）进行可视化展示，同时在其上添加更多信息，如临床因素。然而，一种更具综合性的方法是可取的，即联合分析所有可用数据，并且在可视化过程中以更自然的方式组合不同的数据源。在此，我们专门针对整合可视化并提出一种热图样式的图形显示。为此，我们开发并探索用于混合类型数据聚类的方法，特别关注变量聚类。变量聚类在文献中受到的关注不如样本聚类。我们通过两种新方法扩展了变量聚类方法，一种基于不同关联度量的组合，另一种基于距离相关性。通过模拟研究，我们评估并比较了不同的聚类策略。与应用于相应定量或二值化数据的标准方法相比，应用针对混合类型数据的特定方法被证明具有可比性，并且在许多情况下更具优势。我们针对混合类型变量的两种新方法表现出与现有方法ClustOfVar和偏差校正互信息相似或更好的性能。此外，与ClustOfVar不同，我们的方法提供了差异矩阵，这是一个优势，特别是对于可视化目的而言。实际数据示例旨在展示整合热图和基于差异矩阵的其他图形显示的各种潜在应用。我们证明，所呈现的整合热图比常见的数据显示提供了更多关于变量和样本之间关系的信息。所描述的聚类和可视化方法在我们的R包CluMix中实现，可从https://cran.r-project.org/web/packages/CluMix获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d517/5705083/ef80da2d5067/pone.0188274.g001.jpg

相似文献

Clustering of samples and variables with mixed-type data.

PLoS One. 2017 Nov 28;12(11):e0188274. doi: 10.1371/journal.pone.0188274. eCollection 2017.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Bipartite graph-based approach for clustering of cell lines by gene expression-drug response associations.

Bioinformatics. 2021 Sep 9;37(17):2617-2626. doi: 10.1093/bioinformatics/btab143.

A Bayesian two-way latent structure model for genomic data integration reveals few pan-genomic cluster subtypes in a breast cancer cohort.

Bioinformatics. 2019 Dec 1;35(23):4886-4897. doi: 10.1093/bioinformatics/btz381.

A General Iterative Clustering Algorithm.

Stat Anal Data Min. 2022 Aug;15(4):433-446. doi: 10.1002/sam.11573. Epub 2022 Jan 31.

Automated calibration of consensus weighted distance-based clustering approaches using sharp.

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad635.

longmixr: a tool for robust clustering of high-dimensional cross-sectional and longitudinal variables of mixed data types.

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae137.

NormalizeMets: assessing, selecting and implementing statistical methods for normalizing metabolomics data.

Metabolomics. 2018 Mar 20;14(5):54. doi: 10.1007/s11306-018-1347-7.

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark.

Sci Rep. 2021 Feb 18;11(1):4202. doi: 10.1038/s41598-021-83340-8.

Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).

Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.

引用本文的文献

Integrated evaluation of groundwater hydrochemistry using multivariate statistics and irrigation-based water quality indices.

Sci Rep. 2025 Jul 10;15(1):24923. doi: 10.1038/s41598-025-09874-3.

Identifying subgroups of Chinese men who have sex with men based on sexual behavior and drug use patterns using a clustering analysis approach.

BMC Public Health. 2025 Apr 10;25(1):1353. doi: 10.1186/s12889-025-22388-x.

Exploring the Transitivity Assumption in Network Meta-Analysis: A Novel Approach and Its Implications.

Stat Med. 2025 Mar 30;44(7):e70068. doi: 10.1002/sim.70068.

An empirical study on 209 networks of treatments revealed intransitivity to be common and multiple statistical tests suboptimal to assess transitivity.

BMC Med Res Methodol. 2024 Dec 16;24(1):301. doi: 10.1186/s12874-024-02436-7.

Topological Structures in the Space of Treatment-Naïve Patients with Chronic Lymphocytic Leukemia.

Cancers (Basel). 2024 Jul 26;16(15):2662. doi: 10.3390/cancers16152662.

A robust clustering strategy for stratification unveils unique patient subgroups in acutely decompensated cirrhosis.

J Transl Med. 2024 Jun 27;22(1):599. doi: 10.1186/s12967-024-05386-2.

Precision medicine in oncology - machine learning recommendations.

Am J Cancer Res. 2023 Apr 15;13(4):1617-1619. eCollection 2023.

Transgenerational impact of climatic changes on cotton production.

Front Plant Sci. 2023 Mar 31;14:987514. doi: 10.3389/fpls.2023.987514. eCollection 2023.

Use of mixed-type data clustering algorithm for characterizing temporal and spatial distribution of biosecurity border detections of terrestrial non-indigenous species.

PLoS One. 2022 Aug 9;17(8):e0272413. doi: 10.1371/journal.pone.0272413. eCollection 2022.

UL34 Deletion Restricts Human Cytomegalovirus Capsid Formation and Maturation.

Int J Mol Sci. 2022 May 21;23(10):5773. doi: 10.3390/ijms23105773.

本文引用的文献

Function of cancer associated genes revealed by modern univariate and multivariate association tests.

PLoS One. 2015 May 12;10(5):e0126544. doi: 10.1371/journal.pone.0126544. eCollection 2015.

Inferring nonlinear gene regulatory networks from gene expression data based on distance correlation.

PLoS One. 2014 Feb 14;9(2):e87446. doi: 10.1371/journal.pone.0087446. eCollection 2014.

Using distance correlation and SS-ANOVA to assess associations of familial relationships, lifestyle factors, diseases, and mortality.

Proc Natl Acad Sci U S A. 2012 Dec 11;109(50):20352-7. doi: 10.1073/pnas.1217269109. Epub 2012 Nov 21.

Predicting relapse in patients with medulloblastoma by integrating evidence from clinical and genomic features.

J Clin Oncol. 2011 Apr 10;29(11):1415-23. doi: 10.1200/JCO.2010.28.1675. Epub 2011 Feb 28.

On Brownian Distance Covariance and High Dimensional Data.

Ann Appl Stat. 2009 Jan 1;3(4):1266-1269. doi: 10.1214/09-AOAS312.

A distance-based framework for measuring functional diversity from multiple traits.

Ecology. 2010 Jan;91(1):299-305. doi: 10.1890/08-2244.1.

Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis.

Bioinformatics. 2009 Nov 15;25(22):2906-12. doi: 10.1093/bioinformatics/btp543. Epub 2009 Sep 16.

Biclustering algorithms for biological data analysis: a survey.

IEEE/ACM Trans Comput Biol Bioinform. 2004 Jan-Mar;1(1):24-45. doi: 10.1109/TCBB.2004.2.

Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer.

J Clin Oncol. 2006 Sep 10;24(26):4236-44. doi: 10.1200/JCO.2006.05.6861. Epub 2006 Aug 8.

Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks.

Bioinformatics. 2006 Jul 15;22(14):e184-90. doi: 10.1093/bioinformatics/btl230.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

对具有混合型数据的样本和变量进行聚类。

Clustering of samples and variables with mixed-type data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献