Department of Epidemiology and Biostatistics, Imperial College London, Norfolk place, London W2 1PG, United Kingdom.
Department of Mathematics, Imperial College London, London SW7 2RH, United Kingdom.
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad635.
In consensus clustering, a clustering algorithm is used in combination with a subsampling procedure to detect stable clusters. Previous studies on both simulated and real data suggest that consensus clustering outperforms native algorithms.
We extend here consensus clustering to allow for attribute weighting in the calculation of pairwise distances using existing regularized approaches. We propose a procedure for the calibration of the number of clusters (and regularization parameter) by maximizing the sharp score, a novel stability score calculated directly from consensus clustering outputs, making it extremely computationally competitive. Our simulation study shows better clustering performances of (i) approaches calibrated by maximizing the sharp score compared to existing calibration scores and (ii) weighted compared to unweighted approaches in the presence of features that do not contribute to cluster definition. Application on real gene expression data measured in lung tissue reveals clear clusters corresponding to different lung cancer subtypes.
The R package sharp (version ≥1.4.3) is available on CRAN at https://CRAN.R-project.org/package=sharp.
在共识聚类中,聚类算法与抽样程序结合使用以检测稳定的聚类。先前基于模拟和真实数据的研究表明,共识聚类优于原生算法。
我们在这里扩展共识聚类,允许在计算成对距离时使用现有正则化方法对属性进行加权。我们提出了一种通过最大化尖锐分数来校准聚类数量(和正则化参数)的程序,尖锐分数是直接从共识聚类输出计算得出的一种新的稳定性分数,使其在计算上极具竞争力。我们的模拟研究表明,与现有的校准分数相比,(i)通过最大化尖锐分数进行校准的方法具有更好的聚类性能,以及(ii)在存在对聚类定义没有贡献的特征的情况下,加权方法比非加权方法具有更好的聚类性能。在测量肺组织中基因表达的真实数据上的应用揭示了与不同肺癌亚型相对应的清晰聚类。
R 包 sharp(版本≥1.4.3)可在 https://CRAN.R-project.org/package=sharp 上从 CRAN 获得。