单细胞RNA测序数据中统计差异分析方法的评估

An evaluation of statistical differential analysis methods in single-cell RNA-seq data.

作者信息

Li Dongmei, Zand Martin, Dye Timothy, Goniewicz Maciej, Rahman Irfan, Xie Zidian

机构信息

Clinical and Translational Science Institute, School of Medicine and Dentistry, University of Rochester, 265 Crittenden Boulevard CU 420708, 14642 Rochester, NY, US.

Department of Medicine, University of Rochester Medical Center, 601 Elmwood Ave, Box 675, 14642 Rochester, NY, US.

出版信息

Res Sq. 2023 Mar 23:rs.3.rs-2670717. doi: 10.21203/rs.3.rs-2670717/v1.

DOI:10.21203/rs.3.rs-2670717/v1

PMID:36993457

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10055642/

Abstract

BACKGROUND

Single-cell RNA Sequencing is gaining popularity in recent years. Compared to bulk RNA-Seq, single-cell RNA Sequencing allows the gene expression being measured within individual cells instead of mean gene expression levels across all cells in the sample. Thus, cell-to-cell variation of gene expressions could be examined. Gene differential expression analysis remains the major purpose in most single-cell RNA sequencing experiments and many methods have been developed in recent years to conduct gene differential expression analysis for single-cell RNA sequencing data.

RESULTS

Through simulation studies and real data examples, we evaluated the performance of five open-source popular methods used for gene differential expression analysis in single-cell RNA sequencing data. The five methods included DEsingle (Zero-inflated negative binomial model), Linnorm (Empirical Bayes method on transformed count data using the limma package), monocle (An approximate Chi-Square likelihood ratio test), MAST (A generalized linear hurdle model), and DESeq2 (A generalized linear model with empirical Bayes approach and also commonly used for bulk RNA sequencing differential express analyses). We assessed the false discovery rate (FDR) control, sensitivity, specificity, accuracy, and area under the receiver operating characteristics (AUROC) curve for all five methods under different sample sizes, distribution assumptions, and proportions of zeros in the data.

CONCLUSIONS

We found the MAST method performed the best among the five methods compared with the largest AUROC values across all tested sample sizes and different proportion of truly differential expressed genes, when the data followed negative binomial distributions. When the sample size increased to 100 in each group, the MAST method showed the best performance with the highest AUROC regardless of the data distributions. If the excess zeros were first filtered out before the gene differential analyses, the DESingle, Linnorm, and DESeq2 performed relatively better than the MAST and the monocle methods with higher AUROC values.

摘要

背景

近年来，单细胞RNA测序越来越受欢迎。与批量RNA测序相比，单细胞RNA测序能够测量单个细胞内的基因表达，而不是样本中所有细胞的平均基因表达水平。因此，可以检测基因表达的细胞间差异。基因差异表达分析仍然是大多数单细胞RNA测序实验的主要目的，近年来已经开发了许多方法来进行单细胞RNA测序数据的基因差异表达分析。

结果

通过模拟研究和实际数据示例，我们评估了用于单细胞RNA测序数据基因差异表达分析的五种开源常用方法的性能。这五种方法包括DEsingle（零膨胀负二项式模型）、Linnorm（使用limma软件包对转换后的计数数据进行经验贝叶斯方法）、monocle（近似卡方似然比检验）、MAST（广义线性障碍模型）和DESeq2（具有经验贝叶斯方法的广义线性模型，也常用于批量RNA测序差异表达分析）。我们评估了这五种方法在不同样本量、分布假设和数据中零值比例下的错误发现率（FDR）控制、敏感性、特异性、准确性和受试者工作特征曲线下面积（AUROC）。

结论

我们发现，当数据遵循负二项分布时，在所有测试样本量和不同比例的真正差异表达基因中，MAST方法的AUROC值最大，在这五种方法中表现最佳。当每组样本量增加到100时，无论数据分布如何，MAST方法的AUROC最高，表现最佳。如果在基因差异分析之前先过滤掉过多的零值，DEsingle、Linnorm和DESeq2的表现相对优于MAST和monocle方法，AUROC值更高。