AI Lab, Shenzhen 518054, China.
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, 999077, China.
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae130.
Cell-type clustering is a crucial first step for single-cell RNA-seq data analysis. However, existing clustering methods often provide different results on cluster assignments with respect to their own data pre-processing, choice of distance metrics, and strategies of feature extraction, thereby limiting their practical applications.
We propose Cross-Tabulation Ensemble Clustering (CTEC) method that formulates two re-clustering strategies (distribution- and outlier-based) via cross-tabulation. Benchmarking experiments on five scRNA-Seq datasets illustrate that the proposed CTEC method offers significant improvements over the individual clustering methods. Moreover, CTEC-DB outperforms the state-of-the-art ensemble methods for single-cell data clustering, with 45.4% and 17.1% improvement over the single-cell aggregated from ensemble clustering method (SAFE) and the single-cell aggregated clustering via Mixture model ensemble method (SAME), respectively, on the two-method ensemble test.
The source code of the benchmark in this work is available at the GitHub repository https://github.com/LWCHN/CTEC.git.
细胞类型聚类是单细胞 RNA-seq 数据分析的关键第一步。然而,现有的聚类方法在其数据预处理、距离度量选择和特征提取策略方面往往会提供不同的聚类结果,从而限制了它们的实际应用。
我们提出了 Cross-Tabulation Ensemble Clustering(CTEC)方法,通过交叉制表形成了两种重新聚类策略(基于分布和基于离群值的策略)。在五个 scRNA-Seq 数据集上的基准实验表明,所提出的 CTEC 方法在单个聚类方法上有显著的改进。此外,CTEC-DB 在单细胞数据聚类方面优于最先进的集成方法,在两种方法的集成测试中,与基于集成聚类的单细胞聚合方法(SAFE)相比,分别提高了 45.4%和 17.1%,与基于混合模型集成方法的单细胞聚合聚类(SAME)相比,分别提高了 45.4%和 17.1%。
本工作中的基准测试的源代码可在 GitHub 存储库 https://github.com/LWCHN/CTEC.git 上获得。