Sant Cathrine, Mucke Lennart, Corces M Ryan
Gladstone Institute of Neurological Disease, Gladstone Institutes, San Francisco, CA, USA.
Neuroscience Graduate Program, University of California San Francisco, San Francisco, CA, USA.
Nat Genet. 2025 May;57(5):1309-1319. doi: 10.1038/s41588-025-02148-8. Epub 2025 Apr 7.
Clustering is a critical step in the analysis of single-cell data, enabling the discovery and characterization of cell types and states. However, most popular clustering tools do not subject results to statistical inference testing, leading to risks of overclustering or underclustering data and often resulting in ineffective identification of cell types with widely differing prevalence. To address these challenges, we present CHOIR (cluster hierarchy optimization by iterative random forests), which applies a framework of random forest classifiers and permutation tests across a hierarchical clustering tree to statistically determine clusters representing distinct populations. We demonstrate the performance of CHOIR through extensive benchmarking against 15 existing clustering methods across 230 simulated and five real single-cell RNA sequencing, assay for transposase-accessible chromatin sequencing, spatial transcriptomic and multi-omic datasets. CHOIR can be applied to any single-cell data type and provides a flexible, scalable and robust solution to the challenge of identifying biologically relevant cell groupings within heterogeneous single-cell data.
聚类是单细胞数据分析中的关键步骤,能够发现细胞类型和状态并对其进行表征。然而,大多数流行的聚类工具并未对结果进行统计推断测试,这导致数据过度聚类或聚类不足的风险,并且常常无法有效识别患病率差异很大的细胞类型。为应对这些挑战,我们提出了CHOIR(通过迭代随机森林进行聚类层次优化),它在层次聚类树中应用随机森林分类器和置换检验框架,以统计方式确定代表不同群体的聚类。我们通过对230个模拟数据集以及五个真实的单细胞RNA测序、转座酶可及染色质测序分析、空间转录组学和多组学数据集,与15种现有聚类方法进行广泛的基准测试,展示了CHOIR的性能。CHOIR可应用于任何单细胞数据类型,并为在异质单细胞数据中识别生物学相关细胞分组这一挑战提供了灵活、可扩展且稳健的解决方案。