Aparicio Luis, Bordyuh Mykola, Blumberg Andrew J, Rabadan Raul
Department of Systems Biology, Columbia University, New York NY 10032, USA.
Department of Biomedical Informatics, Columbia University, New York NY 10032, USA.
Patterns (N Y). 2020 May 4;1(3):100035. doi: 10.1016/j.patter.2020.100035. eCollection 2020 Jun 12.
Single-cell technologies provide the opportunity to identify new cellular states. However, a major obstacle to the identification of biological signals is noise in single-cell data. In addition, single-cell data are very sparse. We propose a new method based on random matrix theory to analyze and denoise single-cell sequencing data. The method uses the universal distributions predicted by random matrix theory for the eigenvalues and eigenvectors of random covariance/Wishart matrices to distinguish noise from signal. In addition, we explain how sparsity can cause spurious eigenvector localization, falsely identifying meaningful directions in the data. We show that roughly 95% of the information in single-cell data is compatible with the predictions of random matrix theory, about 3% is spurious signal induced by sparsity, and only the last 2% reflects true biological signal. We demonstrate the effectiveness of our approach by comparing with alternative techniques in a variety of examples with marked cell populations.
单细胞技术为识别新的细胞状态提供了机会。然而,识别生物信号的一个主要障碍是单细胞数据中的噪声。此外,单细胞数据非常稀疏。我们提出了一种基于随机矩阵理论的新方法来分析和去噪单细胞测序数据。该方法使用随机矩阵理论预测的随机协方差/威沙特矩阵的特征值和特征向量的通用分布来区分噪声和信号。此外,我们解释了稀疏性如何导致虚假的特征向量定位,错误地识别数据中有意义的方向。我们表明,单细胞数据中大约95%的信息与随机矩阵理论的预测相符,约3%是由稀疏性引起的虚假信号,只有最后2%反映了真实的生物信号。我们通过在各种具有明显细胞群体的例子中与替代技术进行比较,证明了我们方法的有效性。