Suppr超能文献

多维缩放聚类的统计显著性

Statistical Significance of Clustering with Multidimensional Scaling.

作者信息

Shen Hui, Bhamidi Shankar, Liu Yufeng

机构信息

Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, U.S.A.

Department of Statistics and Operations Research, Department of Genetics, and Department of Biostatistics, Carolina Center for Genome Sciences, Linberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, U.S.A.

出版信息

J Comput Graph Stat. 2024;33(1):219-230. doi: 10.1080/10618600.2023.2219708. Epub 2023 Jul 20.

Abstract

Clustering is a fundamental tool for exploratory data analysis. One central problem in clustering is deciding if the clusters discovered by clustering methods are reliable as opposed to being artifacts of natural sampling variation. Statistical significance of clustering (SigClust) is a recently developed cluster evaluation tool for high-dimension, low-sample size data. Despite its successful application to many scientific problems, there are cases where the original SigClust may not work well. Furthermore, for specific applications, researchers may not have access to the original data and only have the dissimilarity matrix. In this case, clustering is still a valuable exploratory tool, but the original SigClust is not applicable. To address these issues, we propose a new SigClust method using multidimensional scaling (MDS). The underlying idea behind MDS-based SigClust is that one can achieve low-dimensional representations of the original data via MDS using only the dissimilarity matrix and then apply SigClust on the low-dimensional MDS space. The proposed MDS-based SigClust can circumvent the challenge of parameter estimation of the original method in high-dimensional spaces while keeping the essential clustering structure in the MDS space. Both simulations and real data applications demonstrate that the proposed method works remarkably well for assessing the statistical significance of clustering. Supplemental materials for the article are available online.

摘要

聚类是探索性数据分析的一种基本工具。聚类中的一个核心问题是确定通过聚类方法发现的聚类是否可靠,而不是自然抽样变异的产物。聚类的统计显著性(SigClust)是一种最近开发的用于高维、低样本量数据的聚类评估工具。尽管它已成功应用于许多科学问题,但在某些情况下,原始的SigClust可能效果不佳。此外,对于特定应用,研究人员可能无法获取原始数据,而只有差异矩阵。在这种情况下,聚类仍然是一种有价值的探索性工具,但原始的SigClust不适用。为了解决这些问题,我们提出了一种使用多维缩放(MDS)的新SigClust方法。基于MDS的SigClust背后的基本思想是,人们可以仅使用差异矩阵通过MDS获得原始数据的低维表示,然后在低维MDS空间上应用SigClust。所提出的基于MDS的SigClust可以规避原始方法在高维空间中参数估计的挑战,同时在MDS空间中保留基本的聚类结构。模拟和实际数据应用均表明,所提出的方法在评估聚类的统计显著性方面效果显著。本文的补充材料可在线获取。

相似文献

1
Statistical Significance of Clustering with Multidimensional Scaling.
J Comput Graph Stat. 2024;33(1):219-230. doi: 10.1080/10618600.2023.2219708. Epub 2023 Jul 20.
2
Statistical Significance of Clustering using Soft Thresholding.
J Comput Graph Stat. 2015;24(4):975-993. doi: 10.1080/10618600.2014.948179. Epub 2015 Dec 10.
4
Multidimensional scaling for large genomic data sets.
BMC Bioinformatics. 2008 Apr 4;9:179. doi: 10.1186/1471-2105-9-179.
5
On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling.
Psychometrika. 2021 Jun;86(2):489-513. doi: 10.1007/s11336-021-09757-2. Epub 2021 May 19.
6
Statistical significance for hierarchical clustering.
Biometrics. 2017 Sep;73(3):811-821. doi: 10.1111/biom.12647. Epub 2017 Jan 18.
7
Modified multidimensional scaling approach to analyze financial markets.
Chaos. 2014 Jun;24(2):022102. doi: 10.1063/1.4873523.
8
Stability estimation for unsupervised clustering: A review.
Wiley Interdiscip Rev Comput Stat. 2022 Nov-Dec;14(6):e1575. doi: 10.1002/wics.1575. Epub 2022 Jan 9.
9
Multidimensional scaling improves distance-based clustering for microbiome data.
Bioinformatics. 2025 Feb 4;41(2). doi: 10.1093/bioinformatics/btaf042.
10
Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).
Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.

引用本文的文献

1
Powerful significance testing for unbalanced clusters.
J Comput Graph Stat. 2025 Apr 16. doi: 10.1080/10618600.2025.2469756.

本文引用的文献

1
An analysis of classical multidimensional scaling with applications to clustering.
Inf inference. 2022 Apr 23;12(1):72-112. doi: 10.1093/imaiai/iaac004. eCollection 2023 Mar.
2
ENTRYWISE EIGENVECTOR ANALYSIS OF RANDOM MATRICES WITH LOW EXPECTED RANK.
Ann Stat. 2020 Jun;48(3):1452-1474. doi: 10.1214/19-aos1854. Epub 2020 Jul 17.
3
Statistical significance for hierarchical clustering.
Biometrics. 2017 Sep;73(3):811-821. doi: 10.1111/biom.12647. Epub 2017 Jan 18.
4
Statistical Significance of Clustering using Soft Thresholding.
J Comput Graph Stat. 2015;24(4):975-993. doi: 10.1080/10618600.2014.948179. Epub 2015 Dec 10.
5
Integrated genomic characterization of papillary thyroid carcinoma.
Cell. 2014 Oct 23;159(3):676-90. doi: 10.1016/j.cell.2014.09.050.
6
Biological subtypes of breast cancer: Prognostic and therapeutic implications.
World J Clin Oncol. 2014 Aug 10;5(3):412-24. doi: 10.5306/wjco.v5.i3.412.
7
Comprehensive genomic characterization of squamous cell lung cancers.
Nature. 2012 Sep 27;489(7417):519-25. doi: 10.1038/nature11404. Epub 2012 Sep 9.
8
Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer.
Breast Cancer Res. 2010;12(5):R68. doi: 10.1186/bcr2635. Epub 2010 Sep 2.
9
Combining multiple clusterings using evidence accumulation.
IEEE Trans Pattern Anal Mach Intell. 2005 Jun;27(6):835-50. doi: 10.1109/TPAMI.2005.113.
10
Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data.
Bioinformatics. 2002 Nov;18(11):1462-9. doi: 10.1093/bioinformatics/18.11.1462.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验