scEVOLVE：单细胞 RNA-seq 数据的细胞类型增量注释而不忘却。

scEVOLVE: cell-type incremental annotation without forgetting for single-cell RNA-seq data.

机构信息

School of Mathematical Sciences, Peking University, Beijing, China.

Huawei Technologies Co., Ltd., Beijing, China.

出版信息

Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae039.

DOI:10.1093/bib/bbae039

PMID:38366803

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10939389/

Abstract

The evolution in single-cell RNA sequencing (scRNA-seq) technology has opened a new avenue for researchers to inspect cellular heterogeneity with single-cell precision. One crucial aspect of this technology is cell-type annotation, which is fundamental for any subsequent analysis in single-cell data mining. Recently, the scientific community has seen a surge in the development of automatic annotation methods aimed at this task. However, these methods generally operate at a steady-state total cell-type capacity, significantly restricting the cell annotation systems'capacity for continuous knowledge acquisition. Furthermore, creating a unified scRNA-seq annotation system remains challenged by the need to progressively expand its understanding of ever-increasing cell-type concepts derived from a continuous data stream. In response to these challenges, this paper presents a novel and challenging setting for annotation, namely cell-type incremental annotation. This concept is designed to perpetually enhance cell-type knowledge, gleaned from continuously incoming data. This task encounters difficulty with data stream samples that can only be observed once, leading to catastrophic forgetting. To address this problem, we introduce our breakthrough methodology termed scEVOLVE, an incremental annotation method. This innovative approach is built upon the methodology of contrastive sample replay combined with the fundamental principle of partition confidence maximization. Specifically, we initially retain and replay sections of the old data in each subsequent training phase, then establish a unique prototypical learning objective to mitigate the cell-type imbalance problem, as an alternative to using cross-entropy. To effectively emulate a model that trains concurrently with complete data, we introduce a cell-type decorrelation strategy that efficiently scatters feature representations of each cell type uniformly. We constructed the scEVOLVE framework with simplicity and ease of integration into most deep softmax-based single-cell annotation methods. Thorough experiments conducted on a range of meticulously constructed benchmarks consistently prove that our methodology can incrementally learn numerous cell types over an extended period, outperforming other strategies that fail quickly. As far as our knowledge extends, this is the first attempt to propose and formulate an end-to-end algorithm framework to address this new, practical task. Additionally, scEVOLVE, coded in Python using the Pytorch machine-learning library, is freely accessible at https://github.com/aimeeyaoyao/scEVOLVE.

摘要

单细胞 RNA 测序 (scRNA-seq) 技术的发展为研究人员以单细胞精度检查细胞异质性开辟了新途径。该技术的一个关键方面是细胞类型注释，这对于单细胞数据挖掘中的任何后续分析都是基础。最近，科学界看到了针对这一任务的自动注释方法的发展热潮。然而，这些方法通常在稳定的总细胞类型容量下运行，显著限制了细胞注释系统对持续知识获取的能力。此外，创建一个统一的 scRNA-seq 注释系统仍然面临着挑战，需要逐步扩展其对来自连续数据流的不断增加的细胞类型概念的理解。针对这些挑战，本文提出了一种新的、具有挑战性的注释设置，即细胞类型增量注释。这个概念旨在通过不断输入的数据来持续增强细胞类型知识。这个任务在数据流样本上遇到了困难，因为这些样本只能观察一次，导致灾难性遗忘。为了解决这个问题，我们提出了一种名为 scEVOLVE 的突破性增量注释方法。这种创新方法是基于对比样本重放的方法和分区置信最大化的基本原则构建的。具体来说，我们在每个后续的训练阶段首先保留和重放旧数据的部分，然后建立一个独特的原型学习目标来减轻细胞类型不平衡问题，而不是使用交叉熵。为了有效地模拟一个与完整数据同时训练的模型，我们引入了一种细胞类型去相关策略，有效地将每个细胞类型的特征表示均匀地分散开来。我们构建了 scEVOLVE 框架，简单易用，可以集成到大多数基于深度 softmax 的单细胞注释方法中。在一系列精心构建的基准上进行的彻底实验一致证明，我们的方法可以在较长时间内增量学习多个细胞类型，优于其他很快就失败的策略。据我们所知，这是首次尝试提出和制定一个端到端的算法框架来解决这个新的实际任务。此外，scEVOLVE 是用 Python 编写的，使用了 Pytorch 机器学习库，可在 https://github.com/aimeeyaoyao/scEVOLVE 上免费获取。