Department of Biostatistics, University of Washington, Seattle, WA, USA.
School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, People's Republic of China.
BMC Bioinformatics. 2024 Mar 15;25(1):113. doi: 10.1186/s12859-024-05724-7.
Single-cell RNA-sequencing (scRNA) datasets are becoming increasingly popular in clinical and cohort studies, but there is a lack of methods to investigate differentially expressed (DE) genes among such datasets with numerous individuals. While numerous methods exist to find DE genes for scRNA data from limited individuals, differential-expression testing for large cohorts of case and control individuals using scRNA data poses unique challenges due to substantial effects of human variation, i.e., individual-level confounding covariates that are difficult to account for in the presence of sparsely-observed genes.
We develop the eSVD-DE, a matrix factorization that pools information across genes and removes confounding covariate effects, followed by a novel two-sample test in mean expression between case and control individuals. In general, differential testing after dimension reduction yields an inflation of Type-1 errors. However, we overcome this by testing for differences between the case and control individuals' posterior mean distributions via a hierarchical model. In previously published datasets of various biological systems, eSVD-DE has more accuracy and power compared to other DE methods typically repurposed for analyzing cohort-wide differential expression.
eSVD-DE proposes a novel and powerful way to test for DE genes among cohorts after performing a dimension reduction. Accurate identification of differential expression on the individual level, instead of the cell level, is important for linking scRNA-seq studies to our understanding of the human population.
单细胞 RNA 测序(scRNA)数据集在临床和队列研究中越来越受欢迎,但缺乏一种方法来研究具有众多个体的此类数据集之间差异表达(DE)的基因。虽然有许多方法可以针对有限个体的 scRNA 数据找到差异表达的基因,但由于个体水平的混杂协变量的大量影响,即难以在稀疏观察到的基因存在的情况下进行解释的个体水平混杂协变量,使用 scRNA 数据对大量病例和对照个体进行差异表达测试具有独特的挑战。
我们开发了 eSVD-DE,这是一种矩阵分解方法,可以跨基因汇集信息并消除混杂协变量的影响,然后在病例和对照个体之间进行新颖的两样本均值表达测试。通常,降维后的差异测试会导致 Type-1 错误的膨胀。然而,我们通过通过分层模型测试病例和对照个体的后验均值分布之间的差异来克服这一问题。在各种生物系统的先前发表的数据集上,eSVD-DE 与通常重新用于分析全队列差异表达的其他 DE 方法相比,具有更高的准确性和更强的功效。
eSVD-DE 提出了一种在进行降维后在队列之间测试 DE 基因的新方法。在个体水平上而不是细胞水平上准确识别差异表达对于将 scRNA-seq 研究与我们对人类群体的理解联系起来非常重要。