Yang Fan, Liu Sifan, Dobriban Edgar, Woodruff David P
Wharton Statistics Department, University of Pennsylvania, Philadelphia, PA 19104, USA.
Department of Statistics, Stanford University, Stanford, CA 94305, USA.
IEEE Trans Inf Theory. 2021 Dec;67(12):8154-8189. doi: 10.1109/tit.2021.3112821. Epub 2021 Sep 14.
In our "big data" age, the size and complexity of data is steadily increasing. Methods for dimension reduction are ever more popular and useful. Two distinct types of dimension reduction are "data-oblivious" methods such as random projections and sketching, and "data-aware" methods such as principal component analysis (PCA). Both have their strengths, such as speed for random projections, and data-adaptivity for PCA. In this work, we study how to combine them to get the best of both. We study "sketch and solve" methods that take a random projection (or sketch) first, and compute PCA after. We compute the performance of several popular sketching methods (random iid projections, random sampling, subsampled Hadamard transform, CountSketch, etc) in a general "signal-plus-noise" (or spiked) data model. Compared to well-known works, our results (1) give asymptotically exact results, and (2) apply when the signal components are only slightly above the noise, but the projection dimension is non-negligible. We also study stronger signals allowing more general covariance structures. We find that (a) signal strength decreases under projection in a delicate way depending on the structure of the data and the sketching method, (b) orthogonal projections are slightly more accurate, (c) randomization does not hurt too much, due to concentration of measure, (d) CountSketch can be somewhat improved by a normalization method. Our results have implications for statistical learning and data analysis. We also illustrate that the results are highly accurate in simulations and in analyzing empirical data.
在我们这个“大数据”时代,数据的规模和复杂性正在稳步增加。降维方法越来越受欢迎且实用。两种不同类型的降维方法是“数据无关”方法,如随机投影和草图绘制,以及“数据感知”方法,如主成分分析(PCA)。两者都有各自的优势,比如随机投影的速度,以及PCA的数据适应性。在这项工作中,我们研究如何将它们结合起来以充分发挥两者的优势。我们研究“草图并求解”方法,即先进行随机投影(或草图绘制),然后计算PCA。我们在一般的“信号加噪声”(或尖峰)数据模型中计算了几种流行的草图绘制方法(随机独立同分布投影、随机采样、子采样哈达玛变换、CountSketch等)的性能。与知名研究相比,我们的结果(1)给出了渐近精确的结果,(2)适用于信号分量仅略高于噪声但投影维度不可忽略的情况。我们还研究了允许更一般协方差结构的更强信号。我们发现:(a)根据数据结构和草图绘制方法,信号强度在投影下以微妙的方式降低;(b)正交投影略更精确;(c)由于测度集中,随机化不会造成太大损害;(d)CountSketch可以通过一种归一化方法得到一定程度的改进。我们的结果对统计学习和数据分析有影响。我们还表明,这些结果在模拟和分析实证数据时非常准确。