如何使用主成分分析（PCA）和随机投影进行降维？

How to reduce dimension with PCA and random projections?

作者信息

Yang Fan, Liu Sifan, Dobriban Edgar, Woodruff David P

机构信息

Wharton Statistics Department, University of Pennsylvania, Philadelphia, PA 19104, USA.

Department of Statistics, Stanford University, Stanford, CA 94305, USA.

出版信息

IEEE Trans Inf Theory. 2021 Dec;67(12):8154-8189. doi: 10.1109/tit.2021.3112821. Epub 2021 Sep 14.

DOI:10.1109/tit.2021.3112821

PMID:35695837

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9173709/

Abstract

In our "big data" age, the size and complexity of data is steadily increasing. Methods for dimension reduction are ever more popular and useful. Two distinct types of dimension reduction are "data-oblivious" methods such as random projections and sketching, and "data-aware" methods such as principal component analysis (PCA). Both have their strengths, such as speed for random projections, and data-adaptivity for PCA. In this work, we study how to combine them to get the best of both. We study "sketch and solve" methods that take a random projection (or sketch) first, and compute PCA after. We compute the performance of several popular sketching methods (random iid projections, random sampling, subsampled Hadamard transform, CountSketch, etc) in a general "signal-plus-noise" (or spiked) data model. Compared to well-known works, our results (1) give asymptotically exact results, and (2) apply when the signal components are only slightly above the noise, but the projection dimension is non-negligible. We also study stronger signals allowing more general covariance structures. We find that (a) signal strength decreases under projection in a delicate way depending on the structure of the data and the sketching method, (b) orthogonal projections are slightly more accurate, (c) randomization does not hurt too much, due to concentration of measure, (d) CountSketch can be somewhat improved by a normalization method. Our results have implications for statistical learning and data analysis. We also illustrate that the results are highly accurate in simulations and in analyzing empirical data.

摘要

在我们这个“大数据”时代，数据的规模和复杂性正在稳步增加。降维方法越来越受欢迎且实用。两种不同类型的降维方法是“数据无关”方法，如随机投影和草图绘制，以及“数据感知”方法，如主成分分析（PCA）。两者都有各自的优势，比如随机投影的速度，以及PCA的数据适应性。在这项工作中，我们研究如何将它们结合起来以充分发挥两者的优势。我们研究“草图并求解”方法，即先进行随机投影（或草图绘制），然后计算PCA。我们在一般的“信号加噪声”（或尖峰）数据模型中计算了几种流行的草图绘制方法（随机独立同分布投影、随机采样、子采样哈达玛变换、CountSketch等）的性能。与知名研究相比，我们的结果（1）给出了渐近精确的结果，（2）适用于信号分量仅略高于噪声但投影维度不可忽略的情况。我们还研究了允许更一般协方差结构的更强信号。我们发现：（a）根据数据结构和草图绘制方法，信号强度在投影下以微妙的方式降低；（b）正交投影略更精确；（c）由于测度集中，随机化不会造成太大损害；（d）CountSketch可以通过一种归一化方法得到一定程度的改进。我们的结果对统计学习和数据分析有影响。我们还表明，这些结果在模拟和分析实证数据时非常准确。

相似文献

How to reduce dimension with PCA and random projections?如何使用主成分分析（PCA）和随机投影进行降维？

IEEE Trans Inf Theory. 2021 Dec;67(12):8154-8189. doi: 10.1109/tit.2021.3112821. Epub 2021 Sep 14.

Statistical properties of sketching algorithms.草图绘制算法的统计特性。

Biometrika. 2021 Jun;108(2):283-297. doi: 10.1093/biomet/asaa062. Epub 2020 Jul 30.

On randomized sketching algorithms and the Tracy-Widom law.关于随机草图算法与 Tracy-Widom 定律

Stat Comput. 2023;33(1):34. doi: 10.1007/s11222-022-10148-5. Epub 2023 Jan 19.

Projection pursuit in high dimensions.高维中的投影寻踪。

Proc Natl Acad Sci U S A. 2018 Sep 11;115(37):9151-9156. doi: 10.1073/pnas.1801177115. Epub 2018 Aug 27.

Sensing Matrix Design for Compressive Spectral Imaging via Binary Principal Component Analysis.基于二元主成分分析的压缩光谱成像传感矩阵设计

IEEE Trans Image Process. 2019 Dec 19. doi: 10.1109/TIP.2019.2959737.

Performance of principal component analysis and independent component analysis with respect to signal extraction from noisy positron emission tomography data - a study on computer simulated images.主成分分析和独立成分分析在从有噪声的正电子发射断层扫描数据中提取信号方面的性能——一项关于计算机模拟图像的研究。

Open Neuroimag J. 2009 Apr 1;3:1-16. doi: 10.2174/1874440000903010001.

Multiview PCA: A Methodology of Feature Extraction and Dimension Reduction for High-Order Data.多视图主成分分析：一种用于高阶数据的特征提取和降维方法。

IEEE Trans Cybern. 2022 Oct;52(10):11068-11080. doi: 10.1109/TCYB.2021.3106485. Epub 2022 Sep 19.

Sufficient dimension reduction via random-partitions for the large-p-small-n problem.针对高维小样本问题，通过随机划分实现充分降维。

Biometrics. 2019 Mar;75(1):245-255. doi: 10.1111/biom.12926. Epub 2018 Jul 27.

Estimating 4D-CBCT from prior information and extremely limited angle projections using structural PCA and weighted free-form deformation for lung radiotherapy.利用结构主成分分析和加权自由形式变形，根据先验信息和极有限角度投影估计4D-锥形束CT用于肺部放疗。

Med Phys. 2017 Mar;44(3):1089-1104. doi: 10.1002/mp.12102.

Fast GRAPPA reconstruction with random projection.基于随机投影的快速GRAPPA重建

Magn Reson Med. 2015 Jul;74(1):71-80. doi: 10.1002/mrm.25373. Epub 2014 Jul 17.

引用本文的文献

Robust angle-based transfer learning in high dimensions.高维空间中基于稳健角度的迁移学习

J R Stat Soc Series B Stat Methodol. 2024 Dec 3;87(3):723-745. doi: 10.1093/jrsssb/qkae111. eCollection 2025 Jul.

Sharp-SSL: Selective High-Dimensional Axis-Aligned Random Projections for Semi-Supervised Learning.Sharp-SSL：用于半监督学习的选择性高维轴对齐随机投影

J Am Stat Assoc. 2024 Apr 12;120(549):395-407. doi: 10.1080/01621459.2024.2340792. eCollection 2025.

High-fidelity and high-speed wavefront shaping by leveraging complex media.利用复杂介质实现高保真和高速波前整形

Sci Adv. 2024 Jul 5;10(27):eadn2846. doi: 10.1126/sciadv.adn2846. Epub 2024 Jul 3.

Insights into Parkinson's Disease-Related Freezing of Gait Detection and Prediction Approaches: A Meta Analysis.帕金森病相关冻结步态检测与预测方法的研究进展：一项荟萃分析。

Sensors (Basel). 2024 Jun 18;24(12):3959. doi: 10.3390/s24123959.

本文引用的文献

PCA in High Dimensions: An orientation.高维主成分分析：一种导向

Proc IEEE Inst Electr Electron Eng. 2018 Aug;106(8):1277-1292. doi: 10.1109/JPROC.2018.2846730. Epub 2018 Jul 18.

Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model.尖峰协方差模型中特征值的最优收缩

Ann Stat. 2018 Aug;46(4):1742-1778. doi: 10.1214/17-AOS1601. Epub 2018 Jun 27.

Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia.快速主成分分析揭示了乙醇脱氢酶1B在欧洲和东亚的趋同进化。

Am J Hum Genet. 2016 Mar 3;98(3):456-472. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.

Worldwide human relationships inferred from genome-wide patterns of variation.从全基因组变异模式推断全球人类关系。

Science. 2008 Feb 22;319(5866):1100-4. doi: 10.1126/science.1153717.

Randomized algorithms for the low-rank approximation of matrices.矩阵低秩逼近的随机算法。

Proc Natl Acad Sci U S A. 2007 Dec 18;104(51):20167-72. doi: 10.1073/pnas.0709640104. Epub 2007 Dec 4.

A human genome diversity cell line panel.一个人类基因组多样性细胞系面板。

Science. 2002 Apr 12;296(5566):261-2. doi: 10.1126/science.296.5566.261b.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。