Tilburg University, Methods and Statistics, Tilburg, The Netherlands.
KU Leuven, Psychology and Educational Sciences, Leuven, Belgium.
Behav Res Methods. 2024 Mar;56(3):1413-1432. doi: 10.3758/s13428-023-02099-0. Epub 2023 Aug 1.
Principal component analysis (PCA) is an important tool for analyzing large collections of variables. It functions both as a pre-processing tool to summarize many variables into components and as a method to reveal structure in data. Different coefficients play a central role in these two uses. One focuses on the weights when the goal is summarization, while one inspects the loadings if the goal is to reveal structure. It is well known that the solutions to the two approaches can be found by singular value decomposition; weights, loadings, and right singular vectors are mathematically equivalent. What is often overlooked, is that they are no longer equivalent in the setting of sparse PCA methods which induce zeros either in the weights or the loadings. The lack of awareness for this difference has led to questionable research practices in sparse PCA. First, in simulation studies data is generated mostly based only on structures with sparse singular vectors or sparse loadings, neglecting the structure with sparse weights. Second, reported results represent local optima as the iterative routines are often initiated with the right singular vectors. In this paper we critically re-assess sparse PCA methods by also including data generating schemes characterized by sparse weights and different initialization strategies. The results show that relying on commonly used data generating models can lead to over-optimistic conclusions. They also highlight the impact of choice between sparse weights versus sparse loadings methods and the initialization strategies. The practical consequences of this choice are illustrated with empirical datasets.
主成分分析(PCA)是分析大量变量的重要工具。它既是一种将许多变量总结为成分的预处理工具,也是一种揭示数据结构的方法。不同的系数在这两种用途中起着核心作用。一种关注的是目标是总结时的权重,而另一种则在目标是揭示结构时检查加载。众所周知,这两种方法的解可以通过奇异值分解来找到;权重、加载和右奇异向量在数学上是等效的。常常被忽视的是,在诱导权重或加载中的零的稀疏 PCA 方法中,它们不再等效。对这种差异缺乏认识导致了稀疏 PCA 中的可疑研究实践。首先,在模拟研究中,数据主要是基于稀疏奇异向量或稀疏加载的结构生成的,而忽略了具有稀疏权重的结构。其次,报告的结果代表局部最优,因为迭代例程通常是从右奇异向量开始的。在本文中,我们通过还包括具有稀疏权重和不同初始化策略的数据生成方案来批判性地重新评估稀疏 PCA 方法。结果表明,依赖常用的数据生成模型可能会导致过于乐观的结论。它们还强调了在稀疏权重与稀疏加载方法和初始化策略之间进行选择的影响。通过实证数据集说明了这种选择的实际后果。