Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah 84108, USA.
Seoul National University, College of Veterinary Medicine, Seoul, 08826, South Korea.
Genome Res. 2024 Mar 20;34(2):179-188. doi: 10.1101/gr.278253.123.
A mechanistic understanding of the biological and technical factors that impact transcript measurements is essential to designing and analyzing single-cell and single-nucleus RNA sequencing experiments. Nuclei contain the same pre-mRNA population as cells, but they contain a small subset of the mRNAs. Nonetheless, early studies argued that single-nucleus analysis yielded results comparable to cellular samples if pre-mRNA measurements were included. However, typical workflows do not distinguish between pre-mRNA and mRNA when estimating gene expression, and variation in their relative abundances across cell types has received limited attention. These gaps are especially important given that incorporating pre-mRNA has become commonplace for both assays, despite known gene length bias in pre-mRNA capture. Here, we reanalyze public data sets from mouse and human to describe the mechanisms and contrasting effects of mRNA and pre-mRNA sampling on gene expression and marker gene selection in single-cell and single-nucleus RNA-seq. We show that pre-mRNA levels vary considerably among cell types, which mediates the degree of gene length bias and limits the generalizability of a recently published normalization method intended to correct for this bias. As an alternative, we repurpose an existing post hoc gene length-based correction method from conventional RNA-seq gene set enrichment analysis. Finally, we show that inclusion of pre-mRNA in bioinformatic processing can impart a larger effect than assay choice itself, which is pivotal to the effective reuse of existing data. These analyses advance our understanding of the sources of variation in single-cell and single-nucleus RNA-seq experiments and provide useful guidance for future studies.
要设计和分析单细胞和单核 RNA 测序实验,必须深入了解影响转录本测量的生物学和技术因素,以建立机制模型。细胞核中包含与细胞相同的前体 mRNA 群体,但只包含一小部分 mRNA。尽管如此,如果包括前体 mRNA 测量,早期的研究认为单核分析可以得到与细胞样本相当的结果。然而,在估计基因表达时,典型的工作流程不会区分前体 RNA 和 mRNA,并且它们在细胞类型之间的相对丰度的变化受到的关注有限。这些差距非常重要,因为尽管前体 RNA 捕获存在已知的基因长度偏倚,但这两种检测方法都已经普遍包含了前体 RNA。在这里,我们重新分析了来自小鼠和人类的公共数据集,以描述 mRNA 和前体 RNA 采样对单细胞和单核 RNA-seq 中基因表达和标记基因选择的影响机制和对比效果。我们表明,前体 RNA 水平在细胞类型之间存在显著差异,这调节了基因长度偏倚的程度,并限制了最近发表的旨在纠正这种偏倚的归一化方法的通用性。作为替代方法,我们重新利用了常规 RNA-seq 基因集富集分析中现有的基于基因长度的事后校正方法。最后,我们表明,在生物信息学处理中包含前体 RNA 可以产生比检测方法选择本身更大的影响,这对于有效重用现有数据至关重要。这些分析提高了我们对单细胞和单核 RNA-seq 实验中变异来源的理解,并为未来的研究提供了有用的指导。