Bioinformatics and Cellular Genomics, St Vincent's Institute of Medical Research, 9 Princes St, Fitzroy, 3065, VIC, Australia.
School of Mathematics and Statistics, University of Melbourne, Parkville, 3010, VIC, Australia.
Genome Biol. 2024 Feb 26;25(1):56. doi: 10.1186/s13059-024-03183-0.
The development of single-cell RNA sequencing (scRNA-seq) has enabled scientists to catalog and probe the transcriptional heterogeneity of individual cells in unprecedented detail. A common step in the analysis of scRNA-seq data is the selection of so-called marker genes, most commonly to enable annotation of the biological cell types present in the sample. In this paper, we benchmark 59 computational methods for selecting marker genes in scRNA-seq data.
We compare the performance of the methods using 14 real scRNA-seq datasets and over 170 additional simulated datasets. Methods are compared on their ability to recover simulated and expert-annotated marker genes, the predictive performance and characteristics of the gene sets they select, their memory usage and speed, and their implementation quality. In addition, various case studies are used to scrutinize the most commonly used methods, highlighting issues and inconsistencies.
Overall, we present a comprehensive evaluation of methods for selecting marker genes in scRNA-seq data. Our results highlight the efficacy of simple methods, especially the Wilcoxon rank-sum test, Student's t-test, and logistic regression.
单细胞 RNA 测序 (scRNA-seq) 的发展使科学家能够以前所未有的细节对单个细胞的转录异质性进行编目和探测。scRNA-seq 数据分析的一个常见步骤是选择所谓的标记基因,最常见的是能够注释样本中存在的生物细胞类型。在本文中,我们基准测试了 59 种用于选择 scRNA-seq 数据中标记基因的计算方法。
我们使用 14 个真实的 scRNA-seq 数据集和 170 多个额外的模拟数据集比较了方法的性能。方法的比较基于它们恢复模拟和专家注释的标记基因的能力、它们选择的基因集的预测性能和特征、它们的内存使用情况和速度以及它们的实现质量。此外,还使用了各种案例研究来仔细检查最常用的方法,突出了问题和不一致之处。
总的来说,我们对 scRNA-seq 数据中选择标记基因的方法进行了全面评估。我们的结果突出了简单方法的功效,特别是 Wilcoxon 秩和检验、Student's t 检验和逻辑回归。