Suppr超能文献

主成分分析(PCA)聚类及相关噪声的局限性

Limitations of Clustering with PCA and Correlated Noise.

作者信息

Lippitt William, Carlson Nichole E, Arbet Jaron, Fingerlin Tasha E, Maier Lisa A, Kechris Katerina

机构信息

Dept of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.

Dept of Immunology and Genomic Medicine, National Jewish Health, Denver, CO, USA.

出版信息

J Stat Comput Simul. 2024;94(10):2291-2319. doi: 10.1080/00949655.2024.2329976. Epub 2024 May 5.

Abstract

It is now common to have a modest to large number of features on individuals with complex diseases. Unsupervised analyses, such as clustering with and without preprocessing by Principle Component Analysis (PCA), is widely used in practice to uncover subgroups in a sample. However, in many modern studies features are often highly correlated and noisy (e.g. SNP's, -omics, quantitative imaging markers, and electronic health record data). The practical performance of clustering approaches in these settings remains unclear. Through extensive simulations and empirical examples applying Gaussian Mixture Models and related clustering methods, we show these approaches (including variants of kmeans, VarSelLCM, HDClassifier, and Fisher-EM) can have very poor performance in many settings. We also show the poor performance is often driven by either an explicit or implicit assumption by the clustering algorithm that high variance features are relevant while lower variance features are irrelevant, called the variance as relevance assumption. We develop practical pre-processing approaches that improve analysis performance in some cases. This work offers practical guidance on the strengths and limitations of unsupervised clustering approaches in modern data analysis applications.

摘要

现在,患有复杂疾病的个体具有数量不等的多种特征是很常见的。无监督分析,例如在有无主成分分析(PCA)预处理情况下的聚类,在实践中被广泛用于揭示样本中的亚组。然而,在许多现代研究中,特征往往高度相关且存在噪声(例如单核苷酸多态性、组学、定量成像标记和电子健康记录数据)。在这些情况下,聚类方法的实际性能仍不明确。通过应用高斯混合模型和相关聚类方法进行广泛的模拟和实证示例,我们表明这些方法(包括kmeans、VarSelLCM、HDClassifier和Fisher-EM的变体)在许多情况下可能具有非常差的性能。我们还表明,性能不佳通常是由聚类算法的一个显式或隐式假设驱动的,即高方差特征是相关的,而低方差特征是不相关的,这被称为方差即相关性假设。我们开发了一些实用的预处理方法,在某些情况下可以提高分析性能。这项工作为现代数据分析应用中无监督聚类方法的优缺点提供了实用指导。

相似文献

1
Limitations of Clustering with PCA and Correlated Noise.主成分分析(PCA)聚类及相关噪声的局限性
J Stat Comput Simul. 2024;94(10):2291-2319. doi: 10.1080/00949655.2024.2329976. Epub 2024 May 5.
6
Clustering compositional data using Dirichlet mixture model.使用狄利克雷混合模型对组合数据进行聚类。
PLoS One. 2022 May 18;17(5):e0268438. doi: 10.1371/journal.pone.0268438. eCollection 2022.
8
A Flexible EM-Like Clustering Algorithm for Noisy Data.一种用于噪声数据的灵活的类期望最大化聚类算法。
IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):2709-2721. doi: 10.1109/TPAMI.2023.3337195. Epub 2024 Apr 3.

本文引用的文献

4
SC3: consensus clustering of single-cell RNA-seq data.SC3:单细胞RNA测序数据的一致性聚类
Nat Methods. 2017 May;14(5):483-486. doi: 10.1038/nmeth.4236. Epub 2017 Mar 27.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验