主成分分析（PCA）聚类及相关噪声的局限性

Limitations of Clustering with PCA and Correlated Noise.

作者信息

Lippitt William, Carlson Nichole E, Arbet Jaron, Fingerlin Tasha E, Maier Lisa A, Kechris Katerina

机构信息

Dept of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.

Dept of Immunology and Genomic Medicine, National Jewish Health, Denver, CO, USA.

出版信息

J Stat Comput Simul. 2024;94(10):2291-2319. doi: 10.1080/00949655.2024.2329976. Epub 2024 May 5.

DOI:10.1080/00949655.2024.2329976

PMID:39176071

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11338589/

Abstract

It is now common to have a modest to large number of features on individuals with complex diseases. Unsupervised analyses, such as clustering with and without preprocessing by Principle Component Analysis (PCA), is widely used in practice to uncover subgroups in a sample. However, in many modern studies features are often highly correlated and noisy (e.g. SNP's, -omics, quantitative imaging markers, and electronic health record data). The practical performance of clustering approaches in these settings remains unclear. Through extensive simulations and empirical examples applying Gaussian Mixture Models and related clustering methods, we show these approaches (including variants of kmeans, VarSelLCM, HDClassifier, and Fisher-EM) can have very poor performance in many settings. We also show the poor performance is often driven by either an explicit or implicit assumption by the clustering algorithm that high variance features are relevant while lower variance features are irrelevant, called the variance as relevance assumption. We develop practical pre-processing approaches that improve analysis performance in some cases. This work offers practical guidance on the strengths and limitations of unsupervised clustering approaches in modern data analysis applications.

摘要

现在，患有复杂疾病的个体具有数量不等的多种特征是很常见的。无监督分析，例如在有无主成分分析（PCA）预处理情况下的聚类，在实践中被广泛用于揭示样本中的亚组。然而，在许多现代研究中，特征往往高度相关且存在噪声（例如单核苷酸多态性、组学、定量成像标记和电子健康记录数据）。在这些情况下，聚类方法的实际性能仍不明确。通过应用高斯混合模型和相关聚类方法进行广泛的模拟和实证示例，我们表明这些方法（包括kmeans、VarSelLCM、HDClassifier和Fisher-EM的变体）在许多情况下可能具有非常差的性能。我们还表明，性能不佳通常是由聚类算法的一个显式或隐式假设驱动的，即高方差特征是相关的，而低方差特征是不相关的，这被称为方差即相关性假设。我们开发了一些实用的预处理方法，在某些情况下可以提高分析性能。这项工作为现代数据分析应用中无监督聚类方法的优缺点提供了实用指导。

相似文献

Limitations of Clustering with PCA and Correlated Noise.主成分分析（PCA）聚类及相关噪声的局限性

J Stat Comput Simul. 2024;94(10):2291-2319. doi: 10.1080/00949655.2024.2329976. Epub 2024 May 5.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Outcome-guided Bayesian clustering for disease subtype discovery using high-dimensional transcriptomic data.使用高维转录组数据进行疾病亚型发现的结果导向贝叶斯聚类

J Appl Stat. 2024 Jun 7;52(1):183-207. doi: 10.1080/02664763.2024.2362275. eCollection 2025.

VarSelLCM: an R/C++ package for variable selection in model-based clustering of mixed-data with missing values.VarSelLCM：用于基于模型的混合数据缺失值聚类中变量选择的 R/C++ 包。

Bioinformatics. 2019 Apr 1;35(7):1255-1257. doi: 10.1093/bioinformatics/bty786.

Segmentation of Dynamic Total-Body [F]-FDG PET Images Using Unsupervised Clustering.使用无监督聚类对动态全身[F]-FDG PET图像进行分割

Int J Biomed Imaging. 2023 Dec 5;2023:3819587. doi: 10.1155/2023/3819587. eCollection 2023.

Clustering compositional data using Dirichlet mixture model.使用狄利克雷混合模型对组合数据进行聚类。

PLoS One. 2022 May 18;17(5):e0268438. doi: 10.1371/journal.pone.0268438. eCollection 2022.

Statistical Analysis of Microarray Data Clustering using NMF, Spectral Clustering, Kmeans, and GMM.基于 NMF、谱聚类、Kmeans 和 GMM 的基因微阵列数据聚类的统计分析。

IEEE/ACM Trans Comput Biol Bioinform. 2022 Mar-Apr;19(2):1173-1192. doi: 10.1109/TCBB.2020.3025486. Epub 2022 Apr 1.

A Flexible EM-Like Clustering Algorithm for Noisy Data.一种用于噪声数据的灵活的类期望最大化聚类算法。

IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):2709-2721. doi: 10.1109/TPAMI.2023.3337195. Epub 2024 Apr 3.

A sparse negative binomial mixture model for clustering RNA-seq count data.一种用于对RNA测序计数数据进行聚类的稀疏负二项混合模型。

Biostatistics. 2022 Dec 12;24(1):68-84. doi: 10.1093/biostatistics/kxab025.

Massively parallel unsupervised single-particle cryo-EM data clustering via statistical manifold learning.通过统计流形学习实现大规模并行无监督单粒子冷冻电镜数据聚类

PLoS One. 2017 Aug 7;12(8):e0182130. doi: 10.1371/journal.pone.0182130. eCollection 2017.

本文引用的文献

A Metabolomic Severity Score for Airflow Obstruction and Emphysema.气流阻塞和肺气肿的代谢组学严重程度评分

Metabolites. 2022 Apr 19;12(5):368. doi: 10.3390/metabo12050368.

Applying a hierarchical clustering on principal components approach to identify different patterns of the SARS-CoV-2 epidemic across Italian regions.应用主成分层次聚类方法识别意大利各地区 SARS-CoV-2 疫情的不同模式。

Sci Rep. 2021 Mar 29;11(1):7082. doi: 10.1038/s41598-021-86703-3.

Radiomic Features Are Superior to Conventional Quantitative Computed Tomographic Metrics to Identify Coronary Plaques With Napkin-Ring Sign.基于 CT 纹理特征比传统定量 CT 指标更能识别餐巾环征的冠状动脉斑块

Circ Cardiovasc Imaging. 2017 Dec;10(12):e006843. doi: 10.1161/CIRCIMAGING.117.006843.

SC3: consensus clustering of single-cell RNA-seq data.SC3：单细胞RNA测序数据的一致性聚类

Nat Methods. 2017 May;14(5):483-486. doi: 10.1038/nmeth.4236. Epub 2017 Mar 27.

Cardiac Computed Tomography Radiomics: A Comprehensive Review on Radiomic Techniques.心脏计算机断层扫描放射组学：放射组学技术的全面综述。

J Thorac Imaging. 2018 Jan;33(1):26-34. doi: 10.1097/RTI.0000000000000268.

mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models.mclust 5：使用高斯有限混合模型进行聚类、分类和密度估计

R J. 2016 Aug;8(1):289-317.

TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis.TSCAN：单细胞RNA测序分析中的伪时间重建与评估

Nucleic Acids Res. 2016 Jul 27;44(13):e117. doi: 10.1093/nar/gkw430. Epub 2016 May 13.

pcaReduce: hierarchical clustering of single cell transcriptional profiles.主成分分析降维：单细胞转录谱的层次聚类

BMC Bioinformatics. 2016 Mar 22;17:140. doi: 10.1186/s12859-016-0984-y.

Rationale and Design of the Genomic Research in Alpha-1 Antitrypsin Deficiency and Sarcoidosis (GRADS) Study. Sarcoidosis Protocol.α-1抗胰蛋白酶缺乏症与结节病基因组研究（GRADS）的原理与设计。结节病研究方案。

Ann Am Thorac Soc. 2015 Oct;12(10):1561-71. doi: 10.1513/AnnalsATS.201503-172OT.

The Cancer Genome Atlas Pan-Cancer analysis project.癌症基因组图谱泛癌分析项目。

Nat Genet. 2013 Oct;45(10):1113-20. doi: 10.1038/ng.2764.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。