School of Mathematics and Statistics, University of Sydney, Sydney, NSW, 2006, Australia.
Computational Systems Biology Group, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia.
Genome Biol. 2022 Feb 8;23(1):49. doi: 10.1186/s13059-022-02622-0.
A key task in single-cell RNA-seq (scRNA-seq) data analysis is to accurately detect the number of cell types in the sample, which can be critical for downstream analyses such as cell type identification. Various scRNA-seq data clustering algorithms have been specifically designed to automatically estimate the number of cell types through optimising the number of clusters in a dataset. The lack of benchmark studies, however, complicates the choice of the methods.
We systematically benchmark a range of popular clustering algorithms on estimating the number of cell types in a variety of settings by sampling from the Tabula Muris data to create scRNA-seq datasets with a varying number of cell types, varying number of cells in each cell type, and different cell type proportions. The large number of datasets enables us to assess the performance of the algorithms, covering four broad categories of approaches, from various aspects using a panel of criteria. We further cross-compared the performance on datasets with high cell numbers using Tabula Muris and Tabula Sapiens data.
We identify the strengths and weaknesses of each method on multiple criteria including the deviation of estimation from the true number of cell types, variability of estimation, clustering concordance of cells to their predefined cell types, and running time and peak memory usage. We then summarise these results into a multi-aspect recommendation to the users. The proposed stability-based approach for estimating the number of cell types is implemented in an R package and is freely available from ( https://github.com/PYangLab/scCCESS ).
单细胞 RNA 测序(scRNA-seq)数据分析的一个关键任务是准确检测样品中的细胞类型数量,这对于下游分析(如细胞类型鉴定)至关重要。各种 scRNA-seq 数据聚类算法专门设计用于通过优化数据集的聚类数量来自动估计细胞类型的数量。然而,缺乏基准研究使得方法的选择变得复杂。
我们通过从 Tabula Muris 数据中抽样,在各种设置下对一系列流行的聚类算法进行了系统的基准测试,以创建具有不同细胞类型数量、每个细胞类型中细胞数量不同以及不同细胞类型比例的 scRNA-seq 数据集。大量的数据集使我们能够评估算法的性能,涵盖了从各种方面使用一系列标准来评估四个广泛类别的方法。我们进一步使用 Tabula Muris 和 Tabula Sapiens 数据对高细胞数量数据集的性能进行了交叉比较。
我们根据多个标准确定了每种方法的优缺点,包括估计值与真实细胞类型数量的偏差、估计值的可变性、细胞与预定义细胞类型的聚类一致性以及运行时间和峰值内存使用情况。然后,我们将这些结果总结为对用户的多方面建议。我们提出的基于稳定性的细胞类型数量估计方法已在 R 包中实现,并可在(https://github.com/PYangLab/scCCESS)上免费获取。