scGAD：用于广义细胞类型注释和发现的新任务和端到端框架。

scGAD: a new task and end-to-end framework for generalized cell type annotation and discovery.

机构信息

School of Mathematical Sciences, Peking University, Beijing, China.

Huawei Technologies Co., Ltd., Beijing, China.

出版信息

Brief Bioinform. 2023 Mar 19;24(2). doi: 10.1093/bib/bbad045.

DOI:10.1093/bib/bbad045

PMID:36869836

Abstract

The rapid development of single-cell RNA sequencing (scRNA-seq) technology allows us to study gene expression heterogeneity at the cellular level. Cell annotation is the basis for subsequent downstream analysis in single-cell data mining. As more and more well-annotated scRNA-seq reference data become available, many automatic annotation methods have sprung up in order to simplify the cell annotation process on unlabeled target data. However, existing methods rarely explore the fine-grained semantic knowledge of novel cell types absent from the reference data, and they are usually susceptible to batch effects on the classification of seen cell types. Taking into consideration the limitations above, this paper proposes a new and practical task called generalized cell type annotation and discovery for scRNA-seq data whereby target cells are labeled with either seen cell types or cluster labels, instead of a unified 'unassigned' label. To accomplish this, we carefully design a comprehensive evaluation benchmark and propose a novel end-to-end algorithmic framework called scGAD. Specifically, scGAD first builds the intrinsic correspondences on seen and novel cell types by retrieving geometrically and semantically mutual nearest neighbors as anchor pairs. Together with the similarity affinity score, a soft anchor-based self-supervised learning module is then designed to transfer the known label information from reference data to target data and aggregate the new semantic knowledge within target data in the prediction space. To enhance the inter-type separation and intra-type compactness, we further propose a confidential prototype self-supervised learning paradigm to implicitly capture the global topological structure of cells in the embedding space. Such a bidirectional dual alignment mechanism between embedding space and prediction space can better handle batch effect and cell type shift. Extensive results on massive simulation datasets and real datasets demonstrate the superiority of scGAD over various state-of-the-art clustering and annotation methods. We also implement marker gene identification to validate the effectiveness of scGAD in clustering novel cell types and their biological significance. To the best of our knowledge, we are the first to introduce this new and practical task and propose an end-to-end algorithmic framework to solve it. Our method scGAD is implemented in Python using the Pytorch machine-learning library, and it is freely available at https://github.com/aimeeyaoyao/scGAD.

摘要

单细胞 RNA 测序 (scRNA-seq) 技术的快速发展使我们能够在细胞水平上研究基因表达的异质性。细胞注释是单细胞数据挖掘后续下游分析的基础。随着越来越多注释良好的 scRNA-seq 参考数据的出现，许多自动注释方法应运而生，以简化对未标记目标数据的细胞注释过程。然而，现有的方法很少探索新颖细胞类型的细粒度语义知识，并且它们通常容易受到参考数据中未见细胞类型分类的批次效应的影响。考虑到上述限制，本文提出了一个新的实用任务，称为 scRNA-seq 数据的广义细胞类型注释和发现，其中目标细胞被标记为可见细胞类型或簇标签，而不是统一的“未分配”标签。为此，我们精心设计了一个全面的评估基准，并提出了一种名为 scGAD 的新颖端到端算法框架。具体来说，scGAD 首先通过检索几何和语义上相互最近邻作为锚对，在可见和新颖细胞类型之间建立内在对应关系。然后，结合相似性亲和得分，设计了一个软锚点自监督学习模块，将来自参考数据的已知标签信息从参考数据转移到目标数据，并在预测空间中聚合目标数据中的新语义知识。为了增强类型间的分离和类型内的紧凑性，我们进一步提出了一种保密原型自监督学习范例，以隐式捕获嵌入空间中细胞的全局拓扑结构。这种嵌入空间和预测空间之间的双向双重对齐机制可以更好地处理批次效应和细胞类型偏移。在大量模拟数据集和真实数据集上的广泛结果表明，scGAD 优于各种最新的聚类和注释方法。我们还实施了标记基因识别，以验证 scGAD 在聚类新颖细胞类型及其生物学意义方面的有效性。据我们所知，我们是第一个引入这个新的实用任务并提出端到端算法框架来解决它的人。我们的方法 scGAD 是使用 Pytorch 机器学习库在 Python 中实现的，可在 https://github.com/aimeeyaoyao/scGAD 上免费获得。