基于单细胞 RNA 测序数据评估细胞类型数量的聚类算法基准测试。

Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data.

机构信息

School of Mathematics and Statistics, University of Sydney, Sydney, NSW, 2006, Australia.

Computational Systems Biology Group, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia.

出版信息

Genome Biol. 2022 Feb 8;23(1):49. doi: 10.1186/s13059-022-02622-0.

DOI:10.1186/s13059-022-02622-0

PMID:35135612

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8822786/

Abstract

BACKGROUND

A key task in single-cell RNA-seq (scRNA-seq) data analysis is to accurately detect the number of cell types in the sample, which can be critical for downstream analyses such as cell type identification. Various scRNA-seq data clustering algorithms have been specifically designed to automatically estimate the number of cell types through optimising the number of clusters in a dataset. The lack of benchmark studies, however, complicates the choice of the methods.

RESULTS

We systematically benchmark a range of popular clustering algorithms on estimating the number of cell types in a variety of settings by sampling from the Tabula Muris data to create scRNA-seq datasets with a varying number of cell types, varying number of cells in each cell type, and different cell type proportions. The large number of datasets enables us to assess the performance of the algorithms, covering four broad categories of approaches, from various aspects using a panel of criteria. We further cross-compared the performance on datasets with high cell numbers using Tabula Muris and Tabula Sapiens data.

CONCLUSIONS

We identify the strengths and weaknesses of each method on multiple criteria including the deviation of estimation from the true number of cell types, variability of estimation, clustering concordance of cells to their predefined cell types, and running time and peak memory usage. We then summarise these results into a multi-aspect recommendation to the users. The proposed stability-based approach for estimating the number of cell types is implemented in an R package and is freely available from ( https://github.com/PYangLab/scCCESS ).

摘要

背景

单细胞 RNA 测序（scRNA-seq）数据分析的一个关键任务是准确检测样品中的细胞类型数量，这对于下游分析（如细胞类型鉴定）至关重要。各种 scRNA-seq 数据聚类算法专门设计用于通过优化数据集的聚类数量来自动估计细胞类型的数量。然而，缺乏基准研究使得方法的选择变得复杂。

结果

我们通过从 Tabula Muris 数据中抽样，在各种设置下对一系列流行的聚类算法进行了系统的基准测试，以创建具有不同细胞类型数量、每个细胞类型中细胞数量不同以及不同细胞类型比例的 scRNA-seq 数据集。大量的数据集使我们能够评估算法的性能，涵盖了从各种方面使用一系列标准来评估四个广泛类别的方法。我们进一步使用 Tabula Muris 和 Tabula Sapiens 数据对高细胞数量数据集的性能进行了交叉比较。

结论

我们根据多个标准确定了每种方法的优缺点，包括估计值与真实细胞类型数量的偏差、估计值的可变性、细胞与预定义细胞类型的聚类一致性以及运行时间和峰值内存使用情况。然后，我们将这些结果总结为对用户的多方面建议。我们提出的基于稳定性的细胞类型数量估计方法已在 R 包中实现，并可在（https://github.com/PYangLab/scCCESS）上免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c854/8822786/cff95733c4d2/13059_2022_2622_Fig1_HTML.jpg

相似文献

Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data.

Genome Biol. 2022 Feb 8;23(1):49. doi: 10.1186/s13059-022-02622-0.

Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis.

BMC Bioinformatics. 2019 Dec 24;20(Suppl 19):660. doi: 10.1186/s12859-019-3179-5.

A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa.

PLoS Comput Biol. 2018 Apr 9;14(4):e1006053. doi: 10.1371/journal.pcbi.1006053. eCollection 2018 Apr.

scHFC: a hybrid fuzzy clustering method for single-cell RNA-seq data optimized by natural computation.

Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbab588.

scBGEDA: deep single-cell clustering analysis via a dual denoising autoencoder with bipartite graph ensemble clustering.

Bioinformatics. 2023 Feb 14;39(2). doi: 10.1093/bioinformatics/btad075.

Single-cell data clustering based on sparse optimization and low-rank matrix factorization.

G3 (Bethesda). 2021 Jun 17;11(6). doi: 10.1093/g3journal/jkab098.

Learning deep features and topological structure of cells for clustering of scRNA-sequencing data.

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac068.

Deep enhanced constraint clustering based on contrastive learning for scRNA-seq data.

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad222.

Transfer learning for clustering single-cell RNA-seq data crossing-species and batch, case on uterine fibroids.

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad426.

A parameter-free deep embedded clustering method for single-cell RNA-seq data.

Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac172.

引用本文的文献

Comparative benchmarking of single-cell clustering algorithms for transcriptomic and proteomic data.

Genome Biol. 2025 Sep 3;26(1):265. doi: 10.1186/s13059-025-03719-y.

Microbiome drives age-dependent shifts in brain transcriptomic programs at the single-cell level in Drosophila.

NPJ Biofilms Microbiomes. 2025 Aug 12;11(1):162. doi: 10.1038/s41522-025-00781-z.

Single-cell analysis of gene regulatory networks in the mammary glands of P4HA1-knockout mice.

PLoS Genet. 2025 Jul 22;21(7):e1011505. doi: 10.1371/journal.pgen.1011505. eCollection 2025 Jul.

GUIDING CLUSTERING AND ANNOTATION IN SINGLE-CELL RNA SEQUENCING USING THE AVERAGE OVERLAP METRIC.

bioRxiv. 2025 May 10:2025.05.06.652497. doi: 10.1101/2025.05.06.652497.

scICE: enhancing clustering reliability and efficiency of scRNA-seq data with multi-cluster label consistency evaluation.

Nat Commun. 2025 Jul 2;16(1):6031. doi: 10.1038/s41467-025-60702-8.

Optimization of clustering parameters for single-cell RNA analysis using intrinsic goodness metrics.

Front Bioinform. 2025 Jun 11;5:1562410. doi: 10.3389/fbinf.2025.1562410. eCollection 2025.

scEVE: a single-cell RNA-seq ensemble clustering algorithm capitalizing on the differences of predictions between multiple clustering methods.

NAR Genom Bioinform. 2025 Jun 9;7(2):lqaf073. doi: 10.1093/nargab/lqaf073. eCollection 2025 Jun.

Integrative, high-resolution analysis of single-cell gene expression across experimental conditions with PARAFAC2-RISE.

Cell Syst. 2025 Jun 18;16(6):101294. doi: 10.1016/j.cels.2025.101294. Epub 2025 May 15.

Single-cell RNA sequencing highlights a significant retinal Müller glial population in dry age-related macular degeneration.

iScience. 2025 Apr 17;28(5):112464. doi: 10.1016/j.isci.2025.112464. eCollection 2025 May 16.

scMINER: a mutual information-based framework for clustering and hidden driver inference from single-cell transcriptomics data.

Nat Commun. 2025 May 8;16(1):4305. doi: 10.1038/s41467-025-59620-6.

本文引用的文献

Review of single-cell RNA-seq data clustering for cell-type identification and characterization.

RNA. 2023 May;29(5):517-530. doi: 10.1261/rna.078965.121. Epub 2023 Feb 3.

Integrated analysis of multimodal single-cell data.

Cell. 2021 Jun 24;184(13):3573-3587.e29. doi: 10.1016/j.cell.2021.04.048. Epub 2021 May 31.

A multiresolution framework to characterize single-cell state landscapes.

Nat Commun. 2020 Oct 26;11(1):5399. doi: 10.1038/s41467-020-18416-6.

Identification of cell types from single cell data using stable clustering.

Sci Rep. 2020 Jul 23;10(1):12349. doi: 10.1038/s41598-020-66848-3.

scClassify: sample size estimation and multiscale classification of cells using single and multiple reference.

Mol Syst Biol. 2020 Jun;16(6):e9389. doi: 10.15252/msb.20199389.

SHARP: hyperfast and accurate processing of single-cell RNA-seq data via ensemble random projection.

Genome Res. 2020 Feb;30(2):205-213. doi: 10.1101/gr.254557.119. Epub 2020 Jan 28.

Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis.

BMC Bioinformatics. 2019 Dec 24;20(Suppl 19):660. doi: 10.1186/s12859-019-3179-5.

Latent cellular analysis robustly reveals subtle diversity in large-scale single-cell RNA-seq data.

Nucleic Acids Res. 2019 Dec 16;47(22):e143. doi: 10.1093/nar/gkz826.

Spectrum: fast density-aware spectral clustering for single and multi-omic data.

Bioinformatics. 2020 Feb 15;36(4):1159-1166. doi: 10.1093/bioinformatics/btz704.

From Louvain to Leiden: guaranteeing well-connected communities.

Sci Rep. 2019 Mar 26;9(1):5233. doi: 10.1038/s41598-019-41695-z.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于单细胞 RNA 测序数据评估细胞类型数量的聚类算法基准测试。

Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data.

机构信息

School of Mathematics and Statistics, University of Sydney, Sydney, NSW, 2006, Australia.

Computational Systems Biology Group, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia.