Centre for Data Science, School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD, Australia.
School of Information Science and Engineering, Yunnan University, Kunming, China.
PLoS One. 2023 Aug 21;18(8):e0288000. doi: 10.1371/journal.pone.0288000. eCollection 2023.
Various methods have been developed to combine inference across multiple sets of results for unsupervised clustering, within the ensemble clustering literature. The approach of reporting results from one 'best' model out of several candidate clustering models generally ignores the uncertainty that arises from model selection, and results in inferences that are sensitive to the particular model and parameters chosen. Bayesian model averaging (BMA) is a popular approach for combining results across multiple models that offers some attractive benefits in this setting, including probabilistic interpretation of the combined cluster structure and quantification of model-based uncertainty. In this work we introduce clusterBMA, a method that enables weighted model averaging across results from multiple unsupervised clustering algorithms. We use clustering internal validation criteria to develop an approximation of the posterior model probability, used for weighting the results from each model. From a combined posterior similarity matrix representing a weighted average of the clustering solutions across models, we apply symmetric simplex matrix factorisation to calculate final probabilistic cluster allocations. In addition to outperforming other ensemble clustering methods on simulated data, clusterBMA offers unique features including probabilistic allocation to averaged clusters, combining allocation probabilities from 'hard' and 'soft' clustering algorithms, and measuring model-based uncertainty in averaged cluster allocation. This method is implemented in an accompanying R package of the same name. We use simulated datasets to explore the ability of the proposed technique to identify robust integrated clusters with varying levels of separation between subgroups, and with varying numbers of clusters between models. Benchmarking accuracy against four other ensemble methods previously demonstrated to be highly effective in the literature, clusterBMA matches or exceeds the performance of competing approaches under various conditions of dimensionality and cluster separation. clusterBMA substantially outperformed other ensemble methods for high dimensional simulated data with low cluster separation, with 1.16 to 7.12 times better performance as measured by the Adjusted Rand Index. We also explore the performance of this approach through a case study that aims to identify probabilistic clusters of individuals based on electroencephalography (EEG) data. In applied settings for clustering individuals based on health data, the features of probabilistic allocation and measurement of model-based uncertainty in averaged clusters are useful for clinical relevance and statistical communication.
多种方法已被开发用于在无监督聚类的集成聚类文献中,对多个结果集进行推断。从多个候选聚类模型中报告一个“最佳”模型的结果的方法通常忽略了来自模型选择的不确定性,并且导致的推断对选择的特定模型和参数敏感。贝叶斯模型平均(BMA)是一种在这种情况下组合多个模型结果的流行方法,具有一些有吸引力的好处,包括对组合聚类结构的概率解释和量化基于模型的不确定性。在这项工作中,我们引入了 clusterBMA,这是一种能够在来自多个无监督聚类算法的结果之间进行加权模型平均的方法。我们使用聚类内部验证标准来开发近似的后验模型概率,用于加权每个模型的结果。从代表模型跨模型聚类解决方案加权平均值的组合后验相似性矩阵,我们应用对称单形矩阵分解来计算最终的概率聚类分配。除了在模拟数据上优于其他集成聚类方法外,clusterBMA 还具有独特的功能,包括对平均聚类的概率分配、组合“硬”和“软”聚类算法的分配概率,以及测量平均聚类分配中的基于模型的不确定性。该方法在同名的配套 R 包中实现。我们使用模拟数据集来探索该技术识别具有不同亚组之间分离程度和模型之间不同聚类数量的稳健综合聚类的能力。与文献中先前证明非常有效的四种其他集成方法进行基准测试准确性,在各种维度和聚类分离条件下,clusterBMA 与竞争方法的性能相匹配或超过。在具有低聚类分离的高维模拟数据中,clusterBMA 比其他集成方法的性能提高了 1.16 到 7.12 倍,这是通过调整兰德指数测量的。我们还通过旨在根据脑电图(EEG)数据识别个体概率聚类的案例研究来探索这种方法的性能。在基于健康数据对个体进行聚类的应用设置中,平均聚类中的概率分配和基于模型的不确定性的测量特征对于临床相关性和统计交流很有用。