Landa Boris, Zhang Thomas T C K, Kluger Yuval
Program in Applied Mathematics, Yale University.
Department of Electrical and Systems Engineering, University of Pennsylvania.
SIAM J Math Data Sci. 2022;4(4):1420-1446. doi: 10.1137/21m1456807.
Estimating the rank of a corrupted data matrix is an important task in data analysis, most notably for choosing the number of components in PCA. Significant progress on this task was achieved using random matrix theory by characterizing the spectral properties of large noise matrices. However, utilizing such tools is not straightforward when the data matrix consists of count random variables, e.g., Poisson, in which case the noise can be heteroskedastic with an unknown variance in each entry. In this work, we focus on a Poisson random matrix with independent entries and propose a simple procedure, termed , for estimating the rank of the underlying signal matrix (i.e., the Poisson parameter matrix) without any prior knowledge. Our approach is based on the key observation that one can scale the rows and columns of the data matrix simultaneously so that the spectrum of the corresponding noise agrees with the standard Marchenko-Pastur (MP) law, justifying the use of the MP upper edge as a threshold for rank selection. Importantly, the required scaling factors can be estimated directly from the observations by solving a matrix scaling problem via the Sinkhorn-Knopp algorithm. Aside from the Poisson, our approach is extended to families of distributions that satisfy a quadratic relation between the mean and the variance, such as the generalized Poisson, binomial, negative binomial, gamma, and many others. This quadratic relation can also account for missing entries in the data. We conduct numerical experiments that corroborate our theoretical findings, and showcase the advantage of our approach for rank estimation in challenging regimes. Furthermore, we demonstrate the favorable performance of our approach on several real datasets of single-cell RNA sequencing (scRNA-seq), High-Throughput Chromosome Conformation Capture (Hi-C), and document topic modeling.
估计一个被损坏的数据矩阵的秩是数据分析中的一项重要任务,在主成分分析(PCA)中选择成分数量时尤为显著。利用随机矩阵理论,通过刻画大噪声矩阵的谱特性,在这项任务上取得了显著进展。然而,当数据矩阵由计数随机变量(如泊松分布)组成时,使用这些工具并非易事,在这种情况下,噪声可能是异方差的,且每个元素的方差未知。在这项工作中,我们专注于具有独立元素的泊松随机矩阵,并提出了一种简单的程序,称为 ,用于在没有任何先验知识的情况下估计底层信号矩阵(即泊松参数矩阵)的秩。我们的方法基于一个关键观察结果,即可以同时对数据矩阵的行和列进行缩放,使得相应噪声的谱与标准的马尔琴科 - 帕斯图尔(MP)定律一致,这证明了使用MP上边缘作为秩选择的阈值是合理的。重要的是,所需的缩放因子可以通过Sinkhorn - Knopp算法解决矩阵缩放问题,直接从观测值中估计出来。除了泊松分布,我们的方法还扩展到了均值和方差之间满足二次关系的分布族,如广义泊松分布、二项分布、负二项分布、伽马分布等。这种二次关系也可以解释数据中的缺失元素。我们进行了数值实验,证实了我们的理论发现,并展示了我们的方法在具有挑战性的情况下进行秩估计的优势。此外,我们在几个单细胞RNA测序(scRNA - seq)、高通量染色体构象捕获(Hi - C)的真实数据集以及文档主题建模上展示了我们方法的良好性能。