Frost H Robert
Department of Biomedical Data Science, Dartmouth College.
J Comput Graph Stat. 2022;31(2):486-501. doi: 10.1080/10618600.2021.1987254. Epub 2021 Nov 12.
We present a novel technique for sparse principal component analysis. This method, named Eigenvectors from Eigenvalues Sparse Principal Component Analysis (EESPCA), is based on the formula for computing squared eigenvector loadings of a Hermitian matrix from the eigenvalues of the full matrix and associated sub-matrices. We explore two versions of the EESPCA method: a version that uses a fixed threshold for inducing sparsity and a version that selects the threshold via cross-validation. Relative to the state-of-the-art sparse PCA methods of Witten et al., Yuan & Zhang and Tan et al., the fixed threshold EESPCA technique offers an order-of-magnitude improvement in computational speed, does not require estimation of tuning parameters via cross-validation, and can more accurately identify true zero principal component loadings across a range of data matrix sizes and covariance structures. Importantly, the EESPCA method achieves these benefits while maintaining out-of-sample reconstruction error and PC estimation error close to the lowest error generated by all evaluated approaches. EESPCA is a practical and effective technique for sparse PCA with particular relevance to computationally demanding statistical problems such as the analysis of high-dimensional data sets or application of statistical techniques like resampling that involve the repeated calculation of sparse PCs.
我们提出了一种用于稀疏主成分分析的新技术。这种方法名为基于特征值的特征向量稀疏主成分分析(EESPCA),它基于从全矩阵及其相关子矩阵的特征值计算埃尔米特矩阵平方特征向量载荷的公式。我们探索了EESPCA方法的两个版本:一个使用固定阈值来诱导稀疏性的版本,以及一个通过交叉验证选择阈值的版本。相对于Witten等人、Yuan & Zhang以及Tan等人的最新稀疏主成分分析方法,固定阈值的EESPCA技术在计算速度上有数量级的提升,不需要通过交叉验证来估计调优参数,并且在一系列数据矩阵大小和协方差结构中能够更准确地识别真正为零的主成分载荷。重要的是,EESPCA方法在保持样本外重建误差和主成分估计误差接近所有评估方法产生的最低误差的同时,实现了这些优势。EESPCA是一种实用且有效的稀疏主成分分析技术,特别适用于计算要求高的统计问题,如高维数据集的分析或涉及重复计算稀疏主成分的重采样等统计技术的应用。