Van Buren Scott, Sarkar Hirak, Srivastava Avi, Rashid Naim U, Patro Rob, Love Michael I
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27516, USA.
Department of Computer Science, University of Maryland, College Park, MD 20742, USA.
Bioinformatics. 2021 Jul 19;37(12):1699-1707. doi: 10.1093/bioinformatics/btab001.
Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of 'inferential replicates', which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements.
We demonstrate that storing only the mean and variance from a set of inferential replicates ('compression') is sufficient to capture gene-level quantification uncertainty, while reducing disk storage to as low as 9% of original storage, and memory usage when loading data to as low as 6%. Using these values, we generate 'pseudo-inferential' replicates from a negative binomial distribution and propose a general procedure for incorporating these replicates into a proposed statistical testing framework. When applying this procedure to trajectory-based differential expression analyses, we show false positives are reduced by more than a third for genes with high levels of quantification uncertainty. We additionally extend the Swish method to incorporate pseudo-inferential replicates and demonstrate improvements in computation time and memory usage without any loss in performance. Lastly, we show that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes in a real dataset.
makeInfReps and splitSwish are implemented in the R/Bioconductor fishpond package available at https://bioconductor.org/packages/fishpond. Analyses and simulated datasets can be found in the paper's GitHub repo at https://github.com/skvanburen/scUncertaintyPaperCode.
Supplementary data are available at Bioinformatics online.
由于映射到多个基因的读取,单细胞RNA测序(scRNA-seq)数据的基因表达定量估计存在内在的不确定性。许多现有的scRNA-seq定量流程忽略了多映射读取,因此低估了许多基因的预期读取计数。alevin考虑了多映射读取,并允许生成“推理重复”,这反映了定量不确定性。以前的方法在将这些重复纳入统计分析时表现出了更好的性能,但存储和使用这些重复会增加计算时间和内存需求。
我们证明,仅存储一组推理重复的均值和方差(“压缩”)就足以捕获基因水平的定量不确定性,同时将磁盘存储降低到原始存储的9%,加载数据时的内存使用降低到6%。使用这些值,我们从负二项分布生成“伪推理”重复,并提出了将这些重复纳入拟议统计测试框架的一般程序。当将此程序应用于基于轨迹的差异表达分析时,我们表明,对于具有高定量不确定性水平的基因,误报减少了三分之一以上。我们还扩展了Swish方法以纳入伪推理重复,并证明在计算时间和内存使用方面有所改进,而性能没有任何损失。最后,我们表明,丢弃多映射读取可能会导致在真实数据集中对功能重要基因的计数严重低估。
makeInfReps和splitSwish在R/Bioconductor的fishpond包中实现,可在https://bioconductor.org/packages/fishpond获得。分析和模拟数据集可在论文的GitHub仓库中找到,网址为https://github.com/skvanburen/scUncertaintyPaperCode。
补充数据可在《生物信息学》在线获取。