School of Computer Science, University of Manchester, Manchester M13 9PL, UK.
Division of Informatics, Imaging and Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester M13 9PL, UK.
Bioinformatics. 2021 Nov 5;37(21):3788-3795. doi: 10.1093/bioinformatics/btab486.
The negative binomial distribution has been shown to be a good model for counts data from both bulk and single-cell RNA-sequencing (RNA-seq). Gaussian process (GP) regression provides a useful non-parametric approach for modelling temporal or spatial changes in gene expression. However, currently available GP regression methods that implement negative binomial likelihood models do not scale to the increasingly large datasets being produced by single-cell and spatial transcriptomics.
The GPcounts package implements GP regression methods for modelling counts data using a negative binomial likelihood function. Computational efficiency is achieved through the use of variational Bayesian inference. The GP function models changes in the mean of the negative binomial likelihood through a logarithmic link function and the dispersion parameter is fitted by maximum likelihood. We validate the method on simulated time course data, showing better performance to identify changes in over-dispersed counts data than methods based on Gaussian or Poisson likelihoods. To demonstrate temporal inference, we apply GPcounts to single-cell RNA-seq datasets after pseudotime and branching inference. To demonstrate spatial inference, we apply GPcounts to data from the mouse olfactory bulb to identify spatially variable genes and compare to two published GP methods. We also provide the option of modelling additional dropout using a zero-inflated negative binomial. Our results show that GPcounts can be used to model temporal and spatial counts data in cases where simpler Gaussian and Poisson likelihoods are unrealistic.
GPcounts is implemented using the GPflow library in Python and is available at https://github.com/ManchesterBioinference/GPcounts along with the data, code and notebooks required to reproduce the results presented here. The version used for this paper is archived at https://doi.org/10.5281/zenodo.5027066.
Supplementary data are available at Bioinformatics online.
负二项分布已被证明是一种很好的模型,可以用于从批量和单细胞 RNA 测序(RNA-seq)中获得的计数数据。高斯过程(GP)回归为建模基因表达的时间或空间变化提供了一种有用的非参数方法。然而,目前可用的实现负二项式似然模型的 GP 回归方法不适用于单细胞和空间转录组学产生的越来越大的数据集。
GPcounts 包实现了使用负二项式似然函数对计数数据进行 GP 回归的方法。通过使用变分贝叶斯推断,实现了计算效率。GP 函数通过对数链接函数对负二项式似然的均值进行建模,通过最大似然法对离散参数进行拟合。我们在模拟时间序列数据上验证了该方法,结果表明,与基于高斯或泊松似然的方法相比,该方法能够更好地识别过分散计数数据中的变化。为了演示时间推断,我们在经过拟时间和分支推断后,将 GPcounts 应用于单细胞 RNA-seq 数据集。为了演示空间推断,我们将 GPcounts 应用于来自小鼠嗅球的数据,以识别空间变化的基因,并与两种已发表的 GP 方法进行比较。我们还提供了使用零膨胀负二项式对额外缺失值进行建模的选项。我们的结果表明,在简单的高斯和泊松似然不切实际的情况下,GPcounts 可用于对时间和空间计数数据进行建模。
GPcounts 使用 Python 中的 GPflow 库实现,并可在 https://github.com/ManchesterBioinference/GPcounts 上获得,同时还提供了重现本文中呈现的结果所需的数据、代码和笔记本。本文使用的版本归档在 https://doi.org/10.5281/zenodo.5027066。
补充数据可在 Bioinformatics 在线获得。