针对异质数据的聚类方法的头对头比较：基于模拟的基准测试。

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark.

机构信息

Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France.

F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France.

出版信息

Sci Rep. 2021 Feb 18;11(1):4202. doi: 10.1038/s41598-021-83340-8.

DOI:10.1038/s41598-021-83340-8

PMID:33603019

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7892576/

Abstract

The choice of the most appropriate unsupervised machine-learning method for "heterogeneous" or "mixed" data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of "ready-to-use" tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

摘要

对于“异质”或“混合”数据（即同时包含连续和分类变量的数据），选择最合适的无监督机器学习方法可能具有挑战性。我们的目的是使用模拟数据和真实数据来检查混合数据的各种聚类策略的性能。我们对 R 中的“即用型”工具进行了基准分析，比较了 4 种基于模型的（卡米拉算法、潜在类别分析、潜在类别模型 [LCM] 和混合模型聚类）和 5 种基于距离/不相似性的（高维尔距离或无监督 Extra Trees 不相似性，然后是层次聚类或划分中位数聚类、K-原型聚类）聚类方法。使用 7 种不同的情景（人群大小、聚类数、连续和分类变量数、相关（非噪声）变量的比例以及变量相关性的程度（低、中、高），对由 1000 个生成的混合变量虚拟人群进行调整兰德指数（ARI）评估，以评估聚类性能。然后将聚类方法应用于 EPHESUS 随机临床试验数据（评估依普利酮对心力衰竭的疗效的试验），以说明不同聚类技术之间的差异。模拟结果表明，K-原型、卡米拉和 LCM 模型在所有方法中均优于其他方法。总体而言，在所有情况下，使用相似度矩阵的传统算法（如划分中位数聚类和层次聚类）中的方法的 ARI 均低于基于模型的方法。当将聚类方法应用于真实的临床数据集时，LCM 显示出在（1）聚类间的临床特征差异、（2）预后性能（最高 C 指数）和（3）识别具有显著治疗益处的患者亚组方面的有希望的结果。这些发现表明，在测试的算法（仅限于 R 中现成的工具）之间，聚类性能存在关键差异。在大多数测试情景中，基于模型的方法（特别是卡米拉和 LCM 包）和 K-原型通常在异质数据环境中表现最佳。