Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA.
Biostatistics. 2018 Oct 1;19(4):562-578. doi: 10.1093/biostatistics/kxx053.
Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.
直到最近,高通量基因表达技术,如 RNA 测序(RNA-seq),需要数十万的细胞才能产生可靠的测量结果。最近的技术进步使得在单细胞水平上进行全基因组基因表达测量成为可能。单细胞 RNA-seq(scRNA-seq)是最广泛使用的技术,并且有许多出版物都是基于该技术产生的数据。然而,RNA-seq 和 scRNA-seq 数据有显著的不同。特别是,与 RNA-seq 不同的是,scRNA-seq 中大多数报告的表达水平都是零,这可能是由生物驱动的,即在测量时基因不表达 RNA,也可能是由技术驱动的,即基因表达 RNA,但测序技术检测不到足够的水平。另一个区别是,与 RNA-seq 样本相比,报告表达水平为零的基因在单细胞中的比例有很大的差异。然而,目前还不清楚这种细胞间的差异在多大程度上是由技术而不是生物变异驱动的。此外,虽然系统误差,包括批次效应,已被广泛报道为高通量技术的主要挑战,但在基于 scRNA-seq 技术的已发表研究中,这些问题几乎没有得到关注。在这里,我们使用评估实验来检查已发表研究的数据,并证明系统误差可以解释观察到的细胞间表达变异性的很大一部分。具体来说,我们通过证明 scRNA-seq 产生的零比预期的多,并且这种偏差在低表达基因中更大,证明了一些报告的零是由技术变异驱动的,从而提供了证据。此外,由于这种技术变异在细胞间存在差异,因此这个缺失数据问题更加严重。然后,我们展示了这种技术细胞间的可变性如何与新的生物学结果混淆。最后,我们展示并讨论了批次效应和混淆实验如何加剧这个问题。
Biostatistics. 2018-10-1
BMC Genomics. 2016-8-22
Methods Mol Biol. 2021
Proc Natl Acad Sci U S A. 2014-4-29
Methods Mol Biol. 2019
Nat Methods. 2017-6
Methods Mol Biol. 2019
Patterns (N Y). 2025-7-30
Bioinformatics. 2025-8-2
Mol Hum Reprod. 2025-7-3
Brief Bioinform. 2025-7-2
Ann Appl Stat. 2018-3
Nat Methods. 2017-6
Nat Methods. 2017-3-6
Mol Cell. 2017-2-16
Nat Commun. 2017-1-16
Nucleic Acids Res. 2017-1-4
Genome Biol. 2016-4-27