Genome Biology Unit, EMBL, Heidelberg 69117, Germany.
Bioinformatics. 2021 Apr 5;36(24):5701-5702. doi: 10.1093/bioinformatics/btaa1009.
The Gamma-Poisson distribution is a theoretically and empirically motivated model for the sampling variability of single cell RNA-sequencing counts and an essential building block for analysis approaches including differential expression analysis, principal component analysis and factor analysis. Existing implementations for inferring its parameters from data often struggle with the size of single cell datasets, which can comprise millions of cells; at the same time, they do not take full advantage of the fact that zero and other small numbers are frequent in the data. These limitations have hampered uptake of the model, leaving room for statistically inferior approaches such as logarithm(-like) transformation.
We present a new R package for fitting the Gamma-Poisson distribution to data with the characteristics of modern single cell datasets more quickly and more accurately than existing methods. The software can work with data on disk without having to load them into RAM simultaneously.
The package glmGamPoi is available from Bioconductor for Windows, macOS and Linux, and source code is available on github.com/const-ae/glmGamPoi under a GPL-3 license. The scripts to reproduce the results of this paper are available on github.com/const-ae/glmGamPoi-Paper.
Supplementary data are available at Bioinformatics online.
Gamma-Poisson 分布是一种理论上和经验上都有依据的模型,可用于解释单细胞 RNA 测序计数的抽样变异性,也是包括差异表达分析、主成分分析和因子分析在内的分析方法的重要组成部分。从数据中推断其参数的现有实现方法通常难以处理单细胞数据集的规模,这些数据集可能包含数百万个细胞;同时,它们没有充分利用数据中经常出现零和其他小数字的事实。这些限制阻碍了该模型的采用,为统计上较差的方法(如对数似然变换)留下了空间。
我们提出了一个新的 R 包,用于拟合 Gamma-Poisson 分布,与现有方法相比,它可以更快、更准确地处理具有现代单细胞数据集特征的数据。该软件可以在不将数据同时加载到 RAM 中的情况下在磁盘上处理数据。
适用于 Windows、macOS 和 Linux 的 Bioconductor 提供了包 glmGamPoi,源代码可在 github.com/const-ae/glmGamPoi 下根据 GPL-3 许可证获得。可在 github.com/const-ae/glmGamPoi-Paper 上获取重现本文结果的脚本。
补充数据可在 Bioinformatics 在线获取。