Section Leadership and Management, University of Amsterdam, Amsterdam, The Netherlands.
Department of Methodology and Statistics, Tilburg University, Tilburg, Netherlands.
Behav Res Methods. 2023 Aug;55(5):2157-2174. doi: 10.3758/s13428-022-01795-7. Epub 2022 Sep 9.
The growing availability of high-dimensional data sets offers behavioral scientists an unprecedented opportunity to integrate the information hidden in the novel types of data (e.g., genetic data, social media data, and GPS tracks, etc.,) and thereby obtain a more detailed and comprehensive view towards their research questions. In the context of clustering, analyzing the large volume of variables could potentially result in an accurate estimation or a novel discovery of underlying subgroups. However, a unique challenge is that the high-dimensional data sets likely involve a significant amount of irrelevant variables. These irrelevant variables do not contribute to the separation of clusters and they may mask cluster partitions. The current paper addresses this challenge by introducing a new clustering algorithm, called Cardinality K-means or CKM, and by proposing a novel model selection strategy. CKM is able to perform simultaneous clustering and variable selection with high stability. In two simulation studies and an empirical demonstration with genetic data, CKM consistently outperformed competing methods in terms of recovering cluster partitions and identifying signaling variables. Meanwhile, our novel model selection strategy determines the number of clusters based on a subset of variables that are most likely to be signaling variables. Through a simulation study, this strategy was found to result in a more accurate estimation of the number of clusters compared to the conventional strategy that utilizes the full set of variables. Our proposed CKM algorithm, together with the novel model selection strategy, has been implemented in a freely accessible R package.
高维数据集的日益普及为行为科学家提供了一个前所未有的机会,可以整合隐藏在新型数据(例如遗传数据、社交媒体数据和 GPS 轨迹等)中的信息,从而更详细、更全面地了解他们的研究问题。在聚类的背景下,分析大量的变量可能会导致对潜在亚组的准确估计或新发现。然而,一个独特的挑战是,高维数据集可能涉及大量不相关的变量。这些不相关的变量不会有助于聚类的分离,并且它们可能会掩盖聚类划分。本文通过引入一种新的聚类算法,称为基数 K-均值或 CKM,并提出一种新的模型选择策略来解决这一挑战。CKM 能够以高稳定性同时执行聚类和变量选择。在两项模拟研究和一项遗传数据分析中,CKM 在恢复聚类划分和识别信号变量方面始终优于竞争方法。同时,我们的新模型选择策略基于最有可能是信号变量的变量子集来确定聚类的数量。通过模拟研究,与使用全数据集的传统策略相比,该策略发现能够更准确地估计聚类的数量。我们提出的 CKM 算法以及新的模型选择策略已在一个免费的 R 包中实现。