Hua Heyang, Long Wenxin, Pan Yan, Li Siyu, Zhou Jianyu, Wang Haixin, Chen Shengquan
School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China.
Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, 100084, China.
Interdiscip Sci. 2025 Mar;17(1):12-26. doi: 10.1007/s12539-024-00655-6. Epub 2024 Sep 30.
Cancer is a significant global public health concern, where early detection can greatly enhance curative outcomes. Therefore, the identification of cancer cells holds significant importance as the primary method for cancer diagnosis. The advancement of single-cell RNA sequencing (scRNA-seq) technology has made it possible to address the problem of cancer cell identification at the single-cell level more efficiently with computational methods, as opposed to the time-consuming and less reproducible manual identification methods. However, existing computational methods have shown suboptimal identification performance and a lack of capability to incorporate external reference data as prior information. Here, we propose scCrab, a reference-guided automatic cancer cell identification method, which performs ensemble learning based on a Bayesian neural network (BNN) with multi-head self-attention mechanisms and a linear regression model. Through a series of experiments on various datasets, we systematically validated the superior performance of scCrab in both intra- and inter-dataset predictions. Besides, we demonstrated the robustness of scCrab to dropout rate and sample size, and conducted ablation experiments to investigate the contributions of each component in scCrab. Furthermore, as a dedicated model for cancer cell identification, scCrab effectively captures cancer-related biological significance during the identification process.
癌症是全球重大的公共卫生问题,早期检测可显著提高治疗效果。因此,癌细胞的识别作为癌症诊断的主要方法具有重要意义。单细胞RNA测序(scRNA-seq)技术的发展使得利用计算方法在单细胞水平上更有效地解决癌细胞识别问题成为可能,这与耗时且重复性较差的手动识别方法形成对比。然而,现有的计算方法表现出次优的识别性能,并且缺乏将外部参考数据作为先验信息纳入的能力。在此,我们提出了scCrab,一种参考引导的自动癌细胞识别方法,它基于具有多头自注意力机制的贝叶斯神经网络(BNN)和线性回归模型进行集成学习。通过在各种数据集上进行的一系列实验,我们系统地验证了scCrab在数据集内和数据集间预测中的卓越性能。此外,我们展示了scCrab对失活率和样本大小的鲁棒性,并进行了消融实验以研究scCrab中每个组件的贡献。此外,作为一种专门用于癌细胞识别的模型,scCrab在识别过程中有效地捕捉了与癌症相关的生物学意义。