Weine Eric, Carbonetto Peter, Stephens Matthew
bioRxiv. 2024 Jul 4:2024.03.23.586420. doi: 10.1101/2024.03.23.586420.
Motivated by theoretical and practical issues that arise when applying Principal Components Analysis (PCA) to count data, Townes et al introduced "Poisson GLM-PCA", a variation of PCA adapted to count data, as a tool for dimensionality reduction of single-cell RNA sequencing (RNA-seq) data. However, fitting GLM-PCA is computationally challenging. Here we study this problem, and show that a simple algorithm, which we call "Alternating Poisson Regression" (APR), produces better quality fits, and in less time, than existing algorithms. APR is also memory-efficient, and lends itself to parallel implementation on multi-core processors, both of which are helpful for handling large single-cell RNA-seq data sets. We illustrate the benefits of this approach in two published single-cell RNA-seq data sets. The new algorithms are implemented in an R package, fastglmpca.
The fastglmpca R package is released on CRAN for Windows, macOS and Linux, and the source code is available at github.com/stephenslab/fastglmpca under the open source GPL-3 license. Scripts to reproduce the results in this paper are also available in the GitHub repository.
Supplementary data are available on online.
受将主成分分析(PCA)应用于计数数据时出现的理论和实际问题的启发,汤斯等人引入了“泊松广义线性模型 - 主成分分析(Poisson GLM - PCA)”,这是一种适用于计数数据的PCA变体,作为单细胞RNA测序(RNA - seq)数据降维的工具。然而,拟合GLM - PCA在计算上具有挑战性。在这里,我们研究了这个问题,并表明一种我们称为“交替泊松回归(APR)”的简单算法,比现有算法能产生质量更好的拟合,且用时更短。APR还具有内存效率高的特点,并且适合在多核处理器上并行实现,这两点都有助于处理大型单细胞RNA - seq数据集。我们在两个已发表的单细胞RNA - seq数据集中展示了这种方法的优势。新算法在一个R包fastglmpca中实现。
fastglmpca R包已在CRAN上发布,适用于Windows、macOS和Linux,源代码可在github.com/stephenslab/fastglmpca上获取,遵循开源GPL - 3许可。本文中用于重现结果的脚本也可在GitHub仓库中获取。
补充数据可在网上获取。