Department of Biostatistics, Yale School of Public Health, New Haven, CT, 06520, USA.
Section of Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, CT, 06520, USA.
BMC Bioinformatics. 2023 Aug 22;24(1):318. doi: 10.1186/s12859-023-05432-8.
Single-cell RNA sequencing (scRNA-seq) technology has enabled assessment of transcriptome-wide changes at single-cell resolution. Due to the heterogeneity in environmental exposure and genetic background across subjects, subject effect contributes to the major source of variation in scRNA-seq data with multiple subjects, which severely confounds cell type specific differential expression (DE) analysis. Moreover, dropout events are prevalent in scRNA-seq data, leading to excessive number of zeroes in the data, which further aggravates the challenge in DE analysis.
We developed iDESC to detect cell type specific DE genes between two groups of subjects in scRNA-seq data. iDESC uses a zero-inflated negative binomial mixed model to consider both subject effect and dropouts. The prevalence of dropout events (dropout rate) was demonstrated to be dependent on gene expression level, which is modeled by pooling information across genes. Subject effect is modeled as a random effect in the log-mean of the negative binomial component. We evaluated and compared the performance of iDESC with eleven existing DE analysis methods. Using simulated data, we demonstrated that iDESC had well-controlled type I error and higher power compared to the existing methods. Applications of those methods with well-controlled type I error to three real scRNA-seq datasets from the same tissue and disease showed that the results of iDESC achieved the best consistency between datasets and the best disease relevance.
iDESC was able to achieve more accurate and robust DE analysis results by separating subject effect from disease effect with consideration of dropouts to identify DE genes, suggesting the importance of considering subject effect and dropouts in the DE analysis of scRNA-seq data with multiple subjects.
单细胞 RNA 测序(scRNA-seq)技术使人们能够在单细胞分辨率下评估转录组的变化。由于个体之间环境暴露和遗传背景的异质性,个体效应是多个体 scRNA-seq 数据中主要的变异来源之一,严重干扰了细胞类型特异性差异表达(DE)分析。此外,scRNA-seq 数据中普遍存在缺失事件,导致数据中出现大量零值,进一步加剧了 DE 分析的挑战。
我们开发了 iDESC 来检测 scRNA-seq 数据中两组个体之间的细胞类型特异性 DE 基因。iDESC 使用零膨胀负二项混合模型来考虑个体效应和缺失值。缺失事件的发生率(缺失率)被证明与基因表达水平有关,这是通过跨基因信息池化来建模的。个体效应在负二项成分的对数均值中被建模为随机效应。我们评估并比较了 iDESC 与 11 种现有的 DE 分析方法的性能。使用模拟数据,我们表明 iDESC 具有良好的控制型 I 错误和比现有方法更高的功效。将那些具有良好控制型 I 错误的方法应用于来自同一组织和疾病的三个真实 scRNA-seq 数据集,表明 iDESC 的结果在数据集之间具有最佳的一致性,并且与疾病的相关性最好。
iDESC 通过考虑缺失值来分离个体效应和疾病效应,从而能够更准确和稳健地进行 DE 分析,以识别 DE 基因,这表明在多个体 scRNA-seq 数据的 DE 分析中考虑个体效应和缺失值的重要性。