School of Computer Science and Engineering, Central South University, Changsha 410083, China.
Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China.
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i368-i376. doi: 10.1093/bioinformatics/btad216.
Single-cell RNA sequencing (scRNA-seq) offers a powerful tool to dissect the complexity of biological tissues through cell sub-population identification in combination with clustering approaches. Feature selection is a critical step for improving the accuracy and interpretability of single-cell clustering. Existing feature selection methods underutilize the discriminatory potential of genes across distinct cell types. We hypothesize that incorporating such information could further boost the performance of single cell clustering.
We develop CellBRF, a feature selection method that considers genes' relevance to cell types for single-cell clustering. The key idea is to identify genes that are most important for discriminating cell types through random forests guided by predicted cell labels. Moreover, it proposes a class balancing strategy to mitigate the impact of unbalanced cell type distributions on feature importance evaluation. We benchmark CellBRF on 33 scRNA-seq datasets representing diverse biological scenarios and demonstrate that it substantially outperforms state-of-the-art feature selection methods in terms of clustering accuracy and cell neighborhood consistency. Furthermore, we demonstrate the outstanding performance of our selected features through three case studies on cell differentiation stage identification, non-malignant cell subtype identification, and rare cell identification. CellBRF provides a new and effective tool to boost single-cell clustering accuracy.
All source codes of CellBRF are freely available at https://github.com/xuyp-csu/CellBRF.
单细胞 RNA 测序 (scRNA-seq) 通过与聚类方法相结合,提供了一种强大的工具,可以通过细胞亚群识别来剖析生物组织的复杂性。特征选择是提高单细胞聚类准确性和可解释性的关键步骤。现有的特征选择方法未能充分利用不同细胞类型中基因的鉴别潜力。我们假设,纳入此类信息可以进一步提高单细胞聚类的性能。
我们开发了 CellBRF,这是一种特征选择方法,它考虑了基因对单细胞聚类的细胞类型的相关性。其关键思想是通过随机森林,根据预测的细胞标签来识别对区分细胞类型最重要的基因。此外,它还提出了一种类别平衡策略,以减轻细胞类型分布不平衡对特征重要性评估的影响。我们在 33 个代表不同生物学场景的 scRNA-seq 数据集上对 CellBRF 进行了基准测试,结果表明,它在聚类准确性和细胞邻域一致性方面明显优于最先进的特征选择方法。此外,我们通过三个案例研究,即细胞分化阶段识别、非恶性细胞亚型识别和稀有细胞识别,展示了我们所选特征的出色性能。CellBRF 提供了一种新的、有效的工具,可以提高单细胞聚类的准确性。
CellBRF 的所有源代码均可在 https://github.com/xuyp-csu/CellBRF 上免费获取。