Zhu Ziwei, Wang Tengyao, Samworth Richard J
Statistical Laboratory University of Cambridge Cambridge UK.
Department of Statistics University of Michigan Ann Arbor Michigan USA.
J R Stat Soc Series B Stat Methodol. 2022 Nov;84(5):2000-2031. doi: 10.1111/rssb.12550. Epub 2022 Nov 20.
We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.
我们研究了存在缺失观测值情况下的高维主成分分析(PCA)问题。在一个简单的、均匀的观测模型中,我们表明,对于主导主成分的现有观测比例加权(OPW)估计器能够(近乎)达到极小极大最优收敛速率,这呈现出一个有趣的相变。然而,深入研究发现,特别是在观测概率非均匀的更现实场景中,OPW估计器的实证性能可能并不理想;此外,在无噪声情况下,它无法精确恢复主成分。那么,我们的主要贡献是引入一种新方法,我们称之为primePCA,该方法旨在应对观测值可能以非均匀方式缺失的情况。从OPW估计器出发,primePCA迭代地将数据矩阵的观测元素投影到当前估计的列空间上以插补缺失元素,然后通过计算插补后数据矩阵的主导右奇异空间来更新我们的估计。我们证明,在无噪声情况下且信号强度不太小的时候,primePCA的误差以几何速率收敛到零。我们理论保证的一个重要特征是它们依赖于缺失机制的平均性质,而不是最坏情况性质。我们对模拟数据和真实数据的数值研究表明,primePCA在广泛的场景中都表现出非常令人鼓舞的性能,包括数据并非完全随机缺失的情况。