Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, United States.
Department of Data Science, Dana Farber Cancer Institute, Boston, MA 02215, United States.
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae494.
Motivated by theoretical and practical issues that arise when applying Principal component analysis (PCA) to count data, Townes et al. introduced "Poisson GLM-PCA", a variation of PCA adapted to count data, as a tool for dimensionality reduction of single-cell RNA sequencing (scRNA-seq) data. However, fitting GLM-PCA is computationally challenging. Here we study this problem, and show that a simple algorithm, which we call "Alternating Poisson Regression" (APR), produces better quality fits, and in less time, than existing algorithms. APR is also memory-efficient and lends itself to parallel implementation on multi-core processors, both of which are helpful for handling large scRNA-seq datasets. We illustrate the benefits of this approach in three publicly available scRNA-seq datasets. The new algorithms are implemented in an R package, fastglmpca.
The fastglmpca R package is released on CRAN for Windows, macOS and Linux, and the source code is available at github.com/stephenslab/fastglmpca under the open source GPL-3 license. Scripts to reproduce the results in this paper are also available in the GitHub repository and on Zenodo.
为了解决将主成分分析(PCA)应用于计数数据时出现的理论和实际问题,Townes 等人引入了“泊松 GLM-PCA”,这是一种适用于计数数据的 PCA 变体,可作为单细胞 RNA 测序(scRNA-seq)数据降维的工具。然而,拟合 GLM-PCA 在计算上具有挑战性。在这里,我们研究了这个问题,并表明我们称之为“交替泊松回归”(APR)的简单算法可以产生更好的拟合质量,并且时间更短,优于现有的算法。APR 还具有高效的内存使用,并且易于在多核处理器上进行并行实现,这两者都有助于处理大型 scRNA-seq 数据集。我们在三个公开可用的 scRNA-seq 数据集上说明了这种方法的好处。新算法已在 R 包 fastglmpca 中实现,可用于 Windows、macOS 和 Linux,源代码可在 github.com/stephenslab/fastglmpca 上获得,遵循开源 GPL-3 许可证。本文结果的重现脚本也可在 GitHub 存储库和 Zenodo 上获得。