Lippitt William, Carlson Nichole E, Arbet Jaron, Fingerlin Tasha E, Maier Lisa A, Kechris Katerina
Dept of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
Dept of Immunology and Genomic Medicine, National Jewish Health, Denver, CO, USA.
J Stat Comput Simul. 2024;94(10):2291-2319. doi: 10.1080/00949655.2024.2329976. Epub 2024 May 5.
It is now common to have a modest to large number of features on individuals with complex diseases. Unsupervised analyses, such as clustering with and without preprocessing by Principle Component Analysis (PCA), is widely used in practice to uncover subgroups in a sample. However, in many modern studies features are often highly correlated and noisy (e.g. SNP's, -omics, quantitative imaging markers, and electronic health record data). The practical performance of clustering approaches in these settings remains unclear. Through extensive simulations and empirical examples applying Gaussian Mixture Models and related clustering methods, we show these approaches (including variants of kmeans, VarSelLCM, HDClassifier, and Fisher-EM) can have very poor performance in many settings. We also show the poor performance is often driven by either an explicit or implicit assumption by the clustering algorithm that high variance features are relevant while lower variance features are irrelevant, called the variance as relevance assumption. We develop practical pre-processing approaches that improve analysis performance in some cases. This work offers practical guidance on the strengths and limitations of unsupervised clustering approaches in modern data analysis applications.
现在,患有复杂疾病的个体具有数量不等的多种特征是很常见的。无监督分析,例如在有无主成分分析(PCA)预处理情况下的聚类,在实践中被广泛用于揭示样本中的亚组。然而,在许多现代研究中,特征往往高度相关且存在噪声(例如单核苷酸多态性、组学、定量成像标记和电子健康记录数据)。在这些情况下,聚类方法的实际性能仍不明确。通过应用高斯混合模型和相关聚类方法进行广泛的模拟和实证示例,我们表明这些方法(包括kmeans、VarSelLCM、HDClassifier和Fisher-EM的变体)在许多情况下可能具有非常差的性能。我们还表明,性能不佳通常是由聚类算法的一个显式或隐式假设驱动的,即高方差特征是相关的,而低方差特征是不相关的,这被称为方差即相关性假设。我们开发了一些实用的预处理方法,在某些情况下可以提高分析性能。这项工作为现代数据分析应用中无监督聚类方法的优缺点提供了实用指导。