Chen Wenan, Zhang Silu, Williams Justin, Ju Bensheng, Shaner Bridget, Easton John, Wu Gang, Chen Xiang
Center for Applied Bioinformatics, St. Jude Children's Research Hospital, Memphis, TN, United States.
Department of Diagnostic Imaging, St. Jude Children's Research Hospital, Memphis, TN, United States.
Comput Struct Biotechnol J. 2020 Mar 30;18:861-873. doi: 10.1016/j.csbj.2020.03.026. eCollection 2020.
Accounting for batch effects, especially latent batch effects, in differential expression (DE) analysis is critical for identifying true biological effects. Single-cell RNA sequencing (scRNA-seq) is a powerful tool for quantifying cell-to-cell variation in transcript abundance and characterizing cellular dynamics. Although many scRNA-seq DE analysis methods accommodate known batch variables, their performance has not been systematically evaluated. Moreover, the challenge of accounting for latent batch variables in scRNA-seq DE analysis is largely unmet. In contrast, many methods have been developed to account for batch variables (either known or latent) in other high-dimensional data, especially bulk RNA-seq. We extensively evaluate 11 methods for batch variables in different scRNA-seq DE analysis scenarios, with a primary focus on latent batch variables. We demonstrate that for known batch variables, incorporating them as covariates into a regression model outperformed approaches using a batch-corrected matrix. For latent batches, fixed effects models have inflated FDRs, whereas aggregation-based methods and mixed effects models have significant power loss. Surrogate variable based methods generally control the FDR well while achieving good power with small group effects. However, their performance (except that of SVA) deteriorated substantially in scenarios involving large group effects and/or group label impurity. In these settings, SVA achieves relatively good performance despite an occasionally inflated FDR (up to 0.2). Finally we make the following recommendations for scRNA-seq DE analysis: 1) incorporate known batch variables instead of using batch-corrected data; and 2) employ SVA for latent batch correction. However, better methods are still needed to fully unleash the power of scRNA-seq.
在差异表达(DE)分析中考虑批次效应,尤其是潜在批次效应,对于识别真正的生物学效应至关重要。单细胞RNA测序(scRNA-seq)是一种强大的工具,可用于量化转录本丰度中的细胞间差异并表征细胞动态。尽管许多scRNA-seq DE分析方法考虑了已知的批次变量,但其性能尚未得到系统评估。此外,在scRNA-seq DE分析中考虑潜在批次变量的挑战在很大程度上尚未得到解决。相比之下,已经开发了许多方法来考虑其他高维数据中的批次变量(已知或潜在),尤其是批量RNA-seq。我们在不同的scRNA-seq DE分析场景中广泛评估了11种处理批次变量的方法,主要关注潜在批次变量。我们证明,对于已知的批次变量,将它们作为协变量纳入回归模型的方法优于使用批次校正矩阵的方法。对于潜在批次,固定效应模型的错误发现率(FDR)过高,而基于聚合的方法和混合效应模型则存在显著的功效损失。基于替代变量的方法通常能很好地控制FDR,同时在小组效应下具有良好的功效。然而,在涉及大组效应和/或组标签不纯的场景中,它们的性能(SVA除外)会大幅下降。在这些情况下,尽管SVA偶尔会出现过高的FDR(高达0.2),但其性能相对较好。最后,我们对scRNA-seq DE分析提出以下建议:1)纳入已知的批次变量,而不是使用批次校正后的数据;2)采用SVA进行潜在批次校正。然而,仍需要更好的方法来充分发挥scRNA-seq的功效。