Computer Science and Engineering Department, University of Connecticut, Storrs, CT, USA.
Department of Molecular & Cell Biology, University of Connecticut, Storrs, CT, USA.
BMC Bioinformatics. 2019 Jan 18;20(1):40. doi: 10.1186/s12859-019-2599-6.
The analysis of single-cell RNA sequencing (scRNAseq) data plays an important role in understanding the intrinsic and extrinsic cellular processes in biological and biomedical research. One significant effort in this area is the detection of differentially expressed (DE) genes. scRNAseq data, however, are highly heterogeneous and have a large number of zero counts, which introduces challenges in detecting DE genes. Addressing these challenges requires employing new approaches beyond the conventional ones, which are based on a nonzero difference in average expression. Several methods have been developed for differential gene expression analysis of scRNAseq data. To provide guidance on choosing an appropriate tool or developing a new one, it is necessary to evaluate and compare the performance of differential gene expression analysis methods for scRNAseq data.
In this study, we conducted a comprehensive evaluation of the performance of eleven differential gene expression analysis software tools, which are designed for scRNAseq data or can be applied to them. We used simulated and real data to evaluate the accuracy and precision of detection. Using simulated data, we investigated the effect of sample size on the detection accuracy of the tools. Using real data, we examined the agreement among the tools in identifying DE genes, the run time of the tools, and the biological relevance of the detected DE genes.
In general, agreement among the tools in calling DE genes is not high. There is a trade-off between true-positive rates and the precision of calling DE genes. Methods with higher true positive rates tend to show low precision due to their introducing false positives, whereas methods with high precision show low true positive rates due to identifying few DE genes. We observed that current methods designed for scRNAseq data do not tend to show better performance compared to methods designed for bulk RNAseq data. Data multimodality and abundance of zero read counts are the main characteristics of scRNAseq data, which play important roles in the performance of differential gene expression analysis methods and need to be considered in terms of the development of new methods.
单细胞 RNA 测序(scRNAseq)数据分析在理解生物和生物医学研究中内在和外在的细胞过程中起着重要作用。在这个领域的一个重要工作是检测差异表达(DE)基因。然而,scRNAseq 数据高度异质且具有大量零计数,这给检测 DE 基因带来了挑战。要解决这些挑战,需要采用超越传统方法的新方法,传统方法基于平均表达的非零差异。已经开发了几种用于 scRNAseq 数据差异基因表达分析的方法。为了提供选择合适工具或开发新工具的指导,有必要评估和比较 scRNAseq 数据差异基因表达分析方法的性能。
在这项研究中,我们对十一种用于 scRNAseq 数据的差异基因表达分析软件工具的性能进行了全面评估,这些工具是专为 scRNAseq 数据设计的,或者可以应用于 scRNAseq 数据。我们使用模拟和真实数据来评估检测的准确性和精度。使用模拟数据,我们研究了样本量对工具检测准确性的影响。使用真实数据,我们检查了工具在识别 DE 基因方面的一致性、工具的运行时间以及检测到的 DE 基因的生物学相关性。
一般来说,工具在调用 DE 基因方面的一致性不高。在调用 DE 基因的真阳性率和精度之间存在权衡。具有较高真阳性率的方法由于引入了假阳性而往往精度较低,而具有较高精度的方法由于识别的 DE 基因较少而真阳性率较低。我们观察到,为 scRNAseq 数据设计的当前方法并不倾向于比为批量 RNAseq 数据设计的方法表现出更好的性能。数据多模态性和大量零读计数是 scRNAseq 数据的主要特征,它们在差异基因表达分析方法的性能中起着重要作用,需要在新方法的开发中加以考虑。