Suppr超能文献

如何使用主成分分析(PCA)和随机投影进行降维?

How to reduce dimension with PCA and random projections?

作者信息

Yang Fan, Liu Sifan, Dobriban Edgar, Woodruff David P

机构信息

Wharton Statistics Department, University of Pennsylvania, Philadelphia, PA 19104, USA.

Department of Statistics, Stanford University, Stanford, CA 94305, USA.

出版信息

IEEE Trans Inf Theory. 2021 Dec;67(12):8154-8189. doi: 10.1109/tit.2021.3112821. Epub 2021 Sep 14.

Abstract

In our "big data" age, the size and complexity of data is steadily increasing. Methods for dimension reduction are ever more popular and useful. Two distinct types of dimension reduction are "data-oblivious" methods such as random projections and sketching, and "data-aware" methods such as principal component analysis (PCA). Both have their strengths, such as speed for random projections, and data-adaptivity for PCA. In this work, we study how to combine them to get the best of both. We study "sketch and solve" methods that take a random projection (or sketch) first, and compute PCA after. We compute the performance of several popular sketching methods (random iid projections, random sampling, subsampled Hadamard transform, CountSketch, etc) in a general "signal-plus-noise" (or spiked) data model. Compared to well-known works, our results (1) give asymptotically exact results, and (2) apply when the signal components are only slightly above the noise, but the projection dimension is non-negligible. We also study stronger signals allowing more general covariance structures. We find that (a) signal strength decreases under projection in a delicate way depending on the structure of the data and the sketching method, (b) orthogonal projections are slightly more accurate, (c) randomization does not hurt too much, due to concentration of measure, (d) CountSketch can be somewhat improved by a normalization method. Our results have implications for statistical learning and data analysis. We also illustrate that the results are highly accurate in simulations and in analyzing empirical data.

摘要

在我们这个“大数据”时代,数据的规模和复杂性正在稳步增加。降维方法越来越受欢迎且实用。两种不同类型的降维方法是“数据无关”方法,如随机投影和草图绘制,以及“数据感知”方法,如主成分分析(PCA)。两者都有各自的优势,比如随机投影的速度,以及PCA的数据适应性。在这项工作中,我们研究如何将它们结合起来以充分发挥两者的优势。我们研究“草图并求解”方法,即先进行随机投影(或草图绘制),然后计算PCA。我们在一般的“信号加噪声”(或尖峰)数据模型中计算了几种流行的草图绘制方法(随机独立同分布投影、随机采样、子采样哈达玛变换、CountSketch等)的性能。与知名研究相比,我们的结果(1)给出了渐近精确的结果,(2)适用于信号分量仅略高于噪声但投影维度不可忽略的情况。我们还研究了允许更一般协方差结构的更强信号。我们发现:(a)根据数据结构和草图绘制方法,信号强度在投影下以微妙的方式降低;(b)正交投影略更精确;(c)由于测度集中,随机化不会造成太大损害;(d)CountSketch可以通过一种归一化方法得到一定程度的改进。我们的结果对统计学习和数据分析有影响。我们还表明,这些结果在模拟和分析实证数据时非常准确。

相似文献

1
How to reduce dimension with PCA and random projections?如何使用主成分分析(PCA)和随机投影进行降维?
IEEE Trans Inf Theory. 2021 Dec;67(12):8154-8189. doi: 10.1109/tit.2021.3112821. Epub 2021 Sep 14.
2
Statistical properties of sketching algorithms.草图绘制算法的统计特性。
Biometrika. 2021 Jun;108(2):283-297. doi: 10.1093/biomet/asaa062. Epub 2020 Jul 30.
3
On randomized sketching algorithms and the Tracy-Widom law.关于随机草图算法与 Tracy-Widom 定律
Stat Comput. 2023;33(1):34. doi: 10.1007/s11222-022-10148-5. Epub 2023 Jan 19.
4
Projection pursuit in high dimensions.高维中的投影寻踪。
Proc Natl Acad Sci U S A. 2018 Sep 11;115(37):9151-9156. doi: 10.1073/pnas.1801177115. Epub 2018 Aug 27.
10
Fast GRAPPA reconstruction with random projection.基于随机投影的快速GRAPPA重建
Magn Reson Med. 2015 Jul;74(1):71-80. doi: 10.1002/mrm.25373. Epub 2014 Jul 17.

本文引用的文献

1
PCA in High Dimensions: An orientation.高维主成分分析:一种导向
Proc IEEE Inst Electr Electron Eng. 2018 Aug;106(8):1277-1292. doi: 10.1109/JPROC.2018.2846730. Epub 2018 Jul 18.
2
Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model.尖峰协方差模型中特征值的最优收缩
Ann Stat. 2018 Aug;46(4):1742-1778. doi: 10.1214/17-AOS1601. Epub 2018 Jun 27.
5
Randomized algorithms for the low-rank approximation of matrices.矩阵低秩逼近的随机算法。
Proc Natl Acad Sci U S A. 2007 Dec 18;104(51):20167-72. doi: 10.1073/pnas.0709640104. Epub 2007 Dec 4.
6
A human genome diversity cell line panel.一个人类基因组多样性细胞系面板。
Science. 2002 Apr 12;296(5566):261-2. doi: 10.1126/science.296.5566.261b.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验