Ceccarelli Francesco, Liò Pietro, Holden Sean B
Department of Computer Science and Technology, University of Cambridge, 15 JJ Thomson Ave, CB3 0FD, Cambridge, UK.
NAR Genom Bioinform. 2024 Dec 4;6(4):lqae166. doi: 10.1093/nargab/lqae166. eCollection 2024 Dec.
The identification of cell types in single-cell RNA sequencing (scRNA-seq) data is a critical task in understanding complex biological systems. Traditional supervised machine learning methods rely on large, well-labeled datasets, which are often impractical to obtain in open-world scenarios due to budget constraints and incomplete information. To address these challenges, we propose a novel computational framework, named AnnoGCD, building on Generalized Category Discovery (GCD) and Anomaly Detection (AD) for automatic cell type annotation. Our semi-supervised method combines labeled and unlabeled data to accurately classify known cell types and to discover novel ones, even in imbalanced datasets. AnnoGCD includes a semi-supervised block to first classify known cell types, followed by an unsupervised block aimed at identifying and clustering novel cell types. We evaluated our approach on five human scRNA-seq datasets and a mouse model atlas, demonstrating superior performance in both known and novel cell type identification compared to existing methods. Our model also exhibited robustness in datasets with significant class imbalance. The results suggest that AnnoGCD is a powerful tool for the automatic annotation of cell types in scRNA-seq data, providing a scalable solution for biological research and clinical applications. Our code and the datasets used for evaluations are publicly available on GitHub: https://github.com/cecca46/AnnoGCD/.
在单细胞RNA测序(scRNA-seq)数据中识别细胞类型是理解复杂生物系统的一项关键任务。传统的监督式机器学习方法依赖于大型的、标注良好的数据集,由于预算限制和信息不完整,在开放场景中往往难以获得。为应对这些挑战,我们基于广义类别发现(GCD)和异常检测(AD)提出了一种名为AnnoGCD的新型计算框架,用于自动细胞类型注释。我们的半监督方法结合了有标签和无标签的数据,即使在不平衡数据集中也能准确分类已知细胞类型并发现新的细胞类型。AnnoGCD包括一个半监督模块,首先对已知细胞类型进行分类,随后是一个无监督模块,旨在识别和聚类新的细胞类型。我们在五个人类scRNA-seq数据集和一个小鼠模型图谱上评估了我们的方法,与现有方法相比,在已知和新细胞类型识别方面均表现出卓越的性能。我们的模型在具有显著类别不平衡的数据集中也表现出鲁棒性。结果表明,AnnoGCD是scRNA-seq数据中细胞类型自动注释的强大工具,为生物学研究和临床应用提供了可扩展的解决方案。我们用于评估的代码和数据集可在GitHub上公开获取:https://github.com/cecca46/AnnoGCD/ 。