双白化揭示计数矩阵的秩。

Biwhitening Reveals the Rank of a Count Matrix.

作者信息

Landa Boris, Zhang Thomas T C K, Kluger Yuval

机构信息

Program in Applied Mathematics, Yale University.

Department of Electrical and Systems Engineering, University of Pennsylvania.

出版信息

SIAM J Math Data Sci. 2022;4(4):1420-1446. doi: 10.1137/21m1456807.

DOI:10.1137/21m1456807

PMID:37576699

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10417917/

Abstract

Estimating the rank of a corrupted data matrix is an important task in data analysis, most notably for choosing the number of components in PCA. Significant progress on this task was achieved using random matrix theory by characterizing the spectral properties of large noise matrices. However, utilizing such tools is not straightforward when the data matrix consists of count random variables, e.g., Poisson, in which case the noise can be heteroskedastic with an unknown variance in each entry. In this work, we focus on a Poisson random matrix with independent entries and propose a simple procedure, termed , for estimating the rank of the underlying signal matrix (i.e., the Poisson parameter matrix) without any prior knowledge. Our approach is based on the key observation that one can scale the rows and columns of the data matrix simultaneously so that the spectrum of the corresponding noise agrees with the standard Marchenko-Pastur (MP) law, justifying the use of the MP upper edge as a threshold for rank selection. Importantly, the required scaling factors can be estimated directly from the observations by solving a matrix scaling problem via the Sinkhorn-Knopp algorithm. Aside from the Poisson, our approach is extended to families of distributions that satisfy a quadratic relation between the mean and the variance, such as the generalized Poisson, binomial, negative binomial, gamma, and many others. This quadratic relation can also account for missing entries in the data. We conduct numerical experiments that corroborate our theoretical findings, and showcase the advantage of our approach for rank estimation in challenging regimes. Furthermore, we demonstrate the favorable performance of our approach on several real datasets of single-cell RNA sequencing (scRNA-seq), High-Throughput Chromosome Conformation Capture (Hi-C), and document topic modeling.

摘要

估计一个被损坏的数据矩阵的秩是数据分析中的一项重要任务，在主成分分析（PCA）中选择成分数量时尤为显著。利用随机矩阵理论，通过刻画大噪声矩阵的谱特性，在这项任务上取得了显著进展。然而，当数据矩阵由计数随机变量（如泊松分布）组成时，使用这些工具并非易事，在这种情况下，噪声可能是异方差的，且每个元素的方差未知。在这项工作中，我们专注于具有独立元素的泊松随机矩阵，并提出了一种简单的程序，称为，用于在没有任何先验知识的情况下估计底层信号矩阵（即泊松参数矩阵）的秩。我们的方法基于一个关键观察结果，即可以同时对数据矩阵的行和列进行缩放，使得相应噪声的谱与标准的马尔琴科 - 帕斯图尔（MP）定律一致，这证明了使用MP上边缘作为秩选择的阈值是合理的。重要的是，所需的缩放因子可以通过Sinkhorn - Knopp算法解决矩阵缩放问题，直接从观测值中估计出来。除了泊松分布，我们的方法还扩展到了均值和方差之间满足二次关系的分布族，如广义泊松分布、二项分布、负二项分布、伽马分布等。这种二次关系也可以解释数据中的缺失元素。我们进行了数值实验，证实了我们的理论发现，并展示了我们的方法在具有挑战性的情况下进行秩估计的优势。此外，我们在几个单细胞RNA测序（scRNA - seq）、高通量染色体构象捕获（Hi - C）的真实数据集以及文档主题建模上展示了我们方法的良好性能。

相似文献

Biwhitening Reveals the Rank of a Count Matrix.双白化揭示计数矩阵的秩。

SIAM J Math Data Sci. 2022;4(4):1420-1446. doi: 10.1137/21m1456807.

Singular vectors of sums of rectangular random matrices and optimal estimation of high-rank signals: The extensive spike model.矩形随机矩阵之和的奇异向量与高秩信号的最优估计：广义尖峰模型

Phys Rev E. 2023 Nov;108(5-1):054129. doi: 10.1103/PhysRevE.108.054129.

ScLRTC: imputation for single-cell RNA-seq data via low-rank tensor completion.ScLRTC：基于低秩张量补全的单细胞 RNA-seq 数据插补。

BMC Genomics. 2021 Nov 29;22(1):860. doi: 10.1186/s12864-021-08101-3.

DeepTensor: Low-Rank Tensor Decomposition With Deep Network Priors.深度张量：基于深度网络先验的低秩张量分解

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):10337-10348. doi: 10.1109/TPAMI.2024.3450575. Epub 2024 Nov 6.

Improved Task-based Functional MRI Language Mapping in Patients with Brain Tumors through Marchenko-Pastur Principal Component Analysis Denoising.基于马卡罗尼-帕斯图尔主成分分析去噪的脑肿瘤患者任务态功能磁共振语言映射改良。

Radiology. 2021 Feb;298(2):365-373. doi: 10.1148/radiol.2020200822. Epub 2020 Dec 8.

Low Rank Tensor Completion With Poisson Observations.带泊松观测值的低秩张量补全

IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4239-4251. doi: 10.1109/TPAMI.2021.3059299. Epub 2022 Jul 1.

Sparse and Low-Rank Decomposition of a Hankel Structured Matrix for Impulse Noise Removal.汉克尔结构矩阵的稀疏和低秩分解在脉冲噪声去除中的应用。

IEEE Trans Image Process. 2018 Mar;27(3):1448-1461. doi: 10.1109/TIP.2017.2771471. Epub 2017 Nov 9.

A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments.一种灵活的计数数据模型，可适用于广泛复制的 RNA-seq 实验所产生的广泛多样化的表达谱。

BMC Bioinformatics. 2013 Aug 21;14:254. doi: 10.1186/1471-2105-14-254.

The augmented lagrange multipliers method for matrix completion from corrupted samplings with application to mixed Gaussian-impulse noise removal.用于从损坏采样中进行矩阵补全的增广拉格朗日乘子法及其在混合高斯脉冲噪声去除中的应用。

PLoS One. 2014 Sep 23;9(9):e108125. doi: 10.1371/journal.pone.0108125. eCollection 2014.

The Poisson distribution model fits UMI-based single-cell RNA-sequencing data.泊松分布模型适用于基于UMI的单细胞RNA测序数据。

Res Sq. 2023 Feb 6:rs.3.rs-2517698. doi: 10.21203/rs.3.rs-2517698/v1.

引用本文的文献

scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis.scELMo：来自语言模型的嵌入是单细胞数据分析的优秀学习者。

bioRxiv. 2025 Aug 23:2023.12.07.569910. doi: 10.1101/2023.12.07.569910.

Principled PCA separates signal from noise in omics count data.基于原理的主成分分析（PCA）可在组学计数数据中分离信号与噪声。

bioRxiv. 2025 Feb 7:2025.02.03.636129. doi: 10.1101/2025.02.03.636129.

The Dyson equalizer: adaptive noise stabilization for low-rank signal detection and recovery.戴森均衡器：用于低秩信号检测与恢复的自适应噪声稳定

Inf inference. 2025 Jan 16;14(1):iaae036. doi: 10.1093/imaiai/iaae036. eCollection 2025 Mar.

Principled and interpretable alignability testing and integration of single-cell data.有原则且可解释的可对齐性测试和单细胞数据的整合。

Proc Natl Acad Sci U S A. 2024 Mar 5;121(10):e2313719121. doi: 10.1073/pnas.2313719121. Epub 2024 Feb 28.

Causal identification of single-cell experimental perturbation effects with CINEMA-OT.利用 CINEMA-OT 进行单细胞实验扰动影响的因果识别。

Nat Methods. 2023 Nov;20(11):1769-1779. doi: 10.1038/s41592-023-02040-5. Epub 2023 Nov 2.

Dimensionality and Ramping: Signatures of Sentence Integration in the Dynamics of Brains and Deep Language Models.维度和渐变：大脑和深度语言模型动态中句子整合的特征。

J Neurosci. 2023 Jul 19;43(29):5350-5364. doi: 10.1523/JNEUROSCI.1163-22.2023. Epub 2023 May 22.

本文引用的文献

Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis.分离测量和表达模型可澄清单细胞 RNA 测序分析中的混淆。

Nat Genet. 2021 Jun;53(6):770-777. doi: 10.1038/s41588-021-00873-4. Epub 2021 May 24.

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression.使用正则化负二项式回归进行单细胞 RNA-seq 数据的归一化和方差稳定化。

Genome Biol. 2019 Dec 23;20(1):296. doi: 10.1186/s13059-019-1874-1.

Asymptotic performance of PCA for high-dimensional heteroscedastic data.主成分分析（PCA）对高维异方差数据的渐近性能

J Multivar Anal. 2018 Sep;167:435-452. doi: 10.1016/j.jmva.2018.06.002. Epub 2018 Jun 19.

PCA in High Dimensions: An orientation.高维主成分分析：一种导向

Proc IEEE Inst Electr Electron Eng. 2018 Aug;106(8):1277-1292. doi: 10.1109/JPROC.2018.2846730. Epub 2018 Jul 18.

Genome-wide analysis reveals no evidence of trans chromosomal regulation of mammalian immune development.全基因组分析未发现哺乳动物免疫发育的跨染色体调控证据。

PLoS Genet. 2018 Jun 8;14(6):e1007431. doi: 10.1371/journal.pgen.1007431. eCollection 2018 Jun.

Roy's largest root test under rank-one alternatives.在一阶备择假设下的罗伊最大根检验。

Biometrika. 2017 Mar;104(1):181-193. doi: 10.1093/biomet/asw060. Epub 2017 Jan 13.

Single-cell analysis of experience-dependent transcriptomic states in the mouse visual cortex.小鼠视觉皮层中经验依赖的转录组状态的单细胞分析。

Nat Neurosci. 2018 Jan;21(1):120-129. doi: 10.1038/s41593-017-0029-5. Epub 2017 Dec 11.

Massively parallel digital transcriptional profiling of single cells.大规模平行数字化单细胞转录组分析。

Nat Commun. 2017 Jan 16;8:14049. doi: 10.1038/ncomms14049.

An empirical Kaiser criterion.经验 Kaiser 准则。

Psychol Methods. 2017 Sep;22(3):450-466. doi: 10.1037/met0000074. Epub 2016 Mar 31.

The Scree Test For The Number Of Factors.因子数量的碎石检验

Multivariate Behav Res. 1966 Apr 1;1(2):245-76. doi: 10.1207/s15327906mbr0102_10.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。