Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
Division of Biostatistics, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
Nat Methods. 2023 Aug;20(8):1196-1202. doi: 10.1038/s41592-023-01933-9. Epub 2023 Jul 10.
Unsupervised clustering of single-cell RNA-sequencing data enables the identification of distinct cell populations. However, the most widely used clustering algorithms are heuristic and do not formally account for statistical uncertainty. We find that not addressing known sources of variability in a statistically rigorous manner can lead to overconfidence in the discovery of novel cell types. Here we extend a previous method, significance of hierarchical clustering, to propose a model-based hypothesis testing approach that incorporates significance analysis into the clustering algorithm and permits statistical evaluation of clusters as distinct cell populations. We also adapt this approach to permit statistical assessment on the clusters reported by any algorithm. Finally, we extend these approaches to account for batch structure. We benchmarked our approach against popular clustering workflows, demonstrating improved performance. To show practical utility, we applied our approach to the Human Lung Cell Atlas and an atlas of the mouse cerebellar cortex, identifying several cases of over-clustering and recapitulating experimentally validated cell type definitions.
无监督的单细胞 RNA 测序数据分析能够鉴定不同的细胞群体。然而,最广泛使用的聚类算法是启发式的,并没有正式考虑统计不确定性。我们发现,如果不以严格的统计学方法来处理已知的变异来源,可能会导致对新细胞类型的发现过于自信。在这里,我们扩展了先前的方法——层次聚类的显著性,提出了一种基于模型的假设检验方法,该方法将显著性分析纳入聚类算法,并允许对聚类进行统计学评估,将其视为不同的细胞群体。我们还对该方法进行了调整,以允许对任何算法报告的聚类进行统计评估。最后,我们扩展了这些方法以解释批次结构。我们将我们的方法与流行的聚类工作流程进行了基准测试,证明了其性能的提升。为了展示实际应用,我们将我们的方法应用于人类肺细胞图谱和小鼠小脑皮质图谱,鉴定了几种过度聚类的情况,并再现了经过实验验证的细胞类型定义。