利用fastglmpca对单细胞RNA测序数据进行加速降维

Accelerated dimensionality reduction of single-cell RNA sequencing data with fastglmpca.

作者信息

Weine Eric, Carbonetto Peter, Stephens Matthew

出版信息

bioRxiv. 2024 Jul 4:2024.03.23.586420. doi: 10.1101/2024.03.23.586420.

DOI:10.1101/2024.03.23.586420

PMID:38585920

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10996495/

Abstract

SUMMARY

Motivated by theoretical and practical issues that arise when applying Principal Components Analysis (PCA) to count data, Townes et al introduced "Poisson GLM-PCA", a variation of PCA adapted to count data, as a tool for dimensionality reduction of single-cell RNA sequencing (RNA-seq) data. However, fitting GLM-PCA is computationally challenging. Here we study this problem, and show that a simple algorithm, which we call "Alternating Poisson Regression" (APR), produces better quality fits, and in less time, than existing algorithms. APR is also memory-efficient, and lends itself to parallel implementation on multi-core processors, both of which are helpful for handling large single-cell RNA-seq data sets. We illustrate the benefits of this approach in two published single-cell RNA-seq data sets. The new algorithms are implemented in an R package, fastglmpca.

AVAILABILITY AND IMPLEMENTATION

The fastglmpca R package is released on CRAN for Windows, macOS and Linux, and the source code is available at github.com/stephenslab/fastglmpca under the open source GPL-3 license. Scripts to reproduce the results in this paper are also available in the GitHub repository.

CONTACT

mstephens@uchicago.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available on online.

摘要

受将主成分分析（PCA）应用于计数数据时出现的理论和实际问题的启发，汤斯等人引入了“泊松广义线性模型 - 主成分分析（Poisson GLM - PCA）”，这是一种适用于计数数据的PCA变体，作为单细胞RNA测序（RNA - seq）数据降维的工具。然而，拟合GLM - PCA在计算上具有挑战性。在这里，我们研究了这个问题，并表明一种我们称为“交替泊松回归（APR）”的简单算法，比现有算法能产生质量更好的拟合，且用时更短。APR还具有内存效率高的特点，并且适合在多核处理器上并行实现，这两点都有助于处理大型单细胞RNA - seq数据集。我们在两个已发表的单细胞RNA - seq数据集中展示了这种方法的优势。新算法在一个R包fastglmpca中实现。