Yin Yu-Hang, Wang Fang, Li Wei, Liu Qiaoming, Zhou Shengming, Zhou Murong, Jiang Zhongjun, Yu Dong-Jun, Wang Guohua
College of Life Science, Northeast Forestry University, Harbin, 150040, China.
College of Computer and Control Engineering, Northeast Forestry University, Harbin, 150040, China.
Genome Biol. 2025 Sep 3;26(1):265. doi: 10.1186/s13059-025-03719-y.
BACKGROUND: Differences in data distribution, feature dimensions, and quality between different single-cell modalities pose challenges for clustering. Although clustering algorithms have been developed for single-cell transcriptomic or proteomic data, their performance across different omics data types and integration scenarios remains poorly investigated, which limits the selection of methods and future method development. RESULTS: In this study, we conduct a systematic and comparative benchmark analysis of 28 computational algorithms on 10 paired transcriptomic and proteomic datasets, evaluating their performance across various metrics in terms of clustering, peak memory, and running time. We also discuss the impact of highly variable genes (HVGs) and cell type granularity on clustering performance. Additionally, the robustness of these clustering methods on two kinds of omics is evaluating by using 30 simulated datasets. Furthermore, to explore the benefits of integrating omics information for clustering tasks, we integrate single-cell transcriptomic and proteomic data using 7 state-of-the-art integration methods and assess the performance of existing single-omics clustering schemes on the integrated features. CONCLUSIONS: Our findings reveal modality-specific strengths and limitations, highlight the complementary nature of existing methods, and provide actionable insights to guide the selection of appropriate clustering approaches for specific scenarios. Overall, for top performance across two omics, consider scAIDE, scDCC, and FlowSOM, with FlowSOM also offering excellent robustness. For users prioritizing memory efficiency scDCC and scDeepCluster are recommended, while TSCAN, SHARP, and MarkovHC are recommended for users who prioritize time efficiency, and community detection-based methods offer a balance.
背景:不同单细胞模态之间的数据分布、特征维度和质量差异给聚类带来了挑战。尽管已经针对单细胞转录组或蛋白质组数据开发了聚类算法,但它们在不同组学数据类型和整合场景下的性能仍未得到充分研究,这限制了方法的选择和未来方法的开发。 结果:在本研究中,我们对10个配对的转录组和蛋白质组数据集上的28种计算算法进行了系统的比较基准分析,从聚类、峰值内存和运行时间等多个指标评估了它们的性能。我们还讨论了高变基因(HVG)和细胞类型粒度对聚类性能的影响。此外,通过使用30个模拟数据集评估了这些聚类方法在两种组学上的稳健性。此外,为了探索整合组学信息对聚类任务的益处,我们使用7种先进的整合方法整合了单细胞转录组和蛋白质组数据,并评估了现有单一组学聚类方案在整合特征上的性能。 结论:我们的研究结果揭示了模态特异性的优势和局限性,突出了现有方法的互补性,并提供了可行的见解,以指导为特定场景选择合适的聚类方法。总体而言,为了在两种组学上获得最佳性能,可以考虑scAIDE、scDCC和FlowSOM,其中FlowSOM也具有出色的稳健性。对于优先考虑内存效率的用户,建议使用scDCC和scDeepCluster,而对于优先考虑时间效率的用户,建议使用TSCAN、SHARP和MarkovHC,基于社区检测的方法则提供了一种平衡。
Comput Methods Programs Biomed. 2025-9
Brief Bioinform. 2024-3-27
2025-1
Cochrane Database Syst Rev. 2021-4-19
Cochrane Database Syst Rev. 2020-1-9
Nat Methods. 2024-11
Nucleic Acids Res. 2024-1-5
Nat Methods. 2023-8
Nat Biotechnol. 2024-2
Nat Rev Chem. 2020-3
Nat Methods. 2023-3