Sant Cathrine, Mucke Lennart, Corces M Ryan
Gladstone Institute of Neurological Disease, Gladstone Institutes, San Francisco, CA, USA.
Neuroscience Graduate Program, University of California, San Francisco, San Francisco, CA 94158, USA.
bioRxiv. 2025 Feb 19:2024.01.18.576317. doi: 10.1101/2024.01.18.576317.
Clustering is a critical step in the analysis of single-cell data, as it enables the discovery and characterization of putative cell types and states. However, most popular clustering tools do not subject clustering results to statistical inference testing, leading to risks of overclustering or underclustering data and often resulting in ineffective identification of cell types with widely differing prevalence. To address these challenges, we present CHOIR (clustering hierarchy optimization by iterative random forests), which applies a framework of random forest classifiers and permutation tests across a hierarchical clustering tree to statistically determine which clusters represent distinct populations. We demonstrate the enhanced performance of CHOIR through extensive benchmarking against 14 existing clustering methods across 100 simulated and 4 real single-cell RNA-seq, ATAC-seq, spatial transcriptomic, and multi-omic datasets. CHOIR can be applied to any single-cell data type and provides a flexible, scalable, and robust solution to the important challenge of identifying biologically relevant cell groupings within heterogeneous single-cell data.
聚类是单细胞数据分析中的关键步骤,因为它能够发现并表征假定的细胞类型和状态。然而,大多数流行的聚类工具并未对聚类结果进行统计推断测试,这导致数据过度聚类或聚类不足的风险,并常常导致无法有效识别患病率差异很大的细胞类型。为应对这些挑战,我们提出了CHOIR(通过迭代随机森林进行聚类层次优化),它在分层聚类树中应用随机森林分类器和置换检验框架,以统计方式确定哪些聚类代表不同的群体。我们通过针对100个模拟数据集和4个真实的单细胞RNA测序、ATAC测序、空间转录组学和多组学数据集,与14种现有聚类方法进行广泛的基准测试,证明了CHOIR的增强性能。CHOIR可应用于任何单细胞数据类型,并为在异质单细胞数据中识别生物学相关细胞分组这一重要挑战提供了灵活、可扩展且稳健的解决方案。