Zhang Ruiyi, Luo Yunan, Ma Jianzhu, Zhang Ming, Wang Sheng
School of EECS, Peking University, Beijing, China.
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
Bioinformatics. 2022 Mar 4;38(6):1607-1614. doi: 10.1093/bioinformatics/btac007.
Rapidly generated scRNA-seq datasets enable us to understand cellular differences and the function of each individual cell at single-cell resolution. Cell-type classification, which aims at characterizing and labeling groups of cells according to their gene expression, is one of the most important steps for single-cell analysis. To facilitate the manual curation process, supervised learning methods have been used to automatically classify cells. Most of the existing supervised learning approaches only utilize annotated cells in the training step while ignoring the more abundant unannotated cells. In this article, we proposed scPretrain, a multi-task self-supervised learning approach that jointly considers annotated and unannotated cells for cell-type classification. scPretrain consists of a pre-training step and a fine-tuning step. In the pre-training step, scPretrain uses a multi-task learning framework to train a feature extraction encoder based on each dataset's pseudo-labels, where only unannotated cells are used. In the fine-tuning step, scPretrain fine-tunes this feature extraction encoder using the limited annotated cells in a new dataset.
We evaluated scPretrain on 60 diverse datasets from different technologies, species and organs, and obtained a significant improvement on both cell-type classification and cell clustering. Moreover, the representations obtained by scPretrain in the pre-training step also enhanced the performance of conventional classifiers, such as random forest, logistic regression and support-vector machines. scPretrain is able to effectively utilize the massive amount of unlabeled data and be applied to annotating increasingly generated scRNA-seq datasets.
The data and code underlying this article are available in scPretrain: Multi-task self-supervised learning for cell type classification, at https://github.com/ruiyi-zhang/scPretrain and https://zenodo.org/record/5802306.
Supplementary data are available at Bioinformatics online.
快速生成的单细胞RNA测序(scRNA-seq)数据集使我们能够在单细胞分辨率下了解细胞差异和每个细胞的功能。细胞类型分类旨在根据细胞的基因表达对细胞群体进行表征和标记,是单细胞分析中最重要的步骤之一。为了便于人工整理过程,监督学习方法已被用于自动对细胞进行分类。现有的大多数监督学习方法在训练步骤中只利用有注释的细胞,而忽略了更丰富的无注释细胞。在本文中,我们提出了scPretrain,这是一种多任务自监督学习方法,在细胞类型分类中联合考虑有注释和无注释的细胞。scPretrain由预训练步骤和微调步骤组成。在预训练步骤中,scPretrain使用多任务学习框架基于每个数据集的伪标签训练一个特征提取编码器,其中只使用无注释的细胞。在微调步骤中,scPretrain使用新数据集中有限的有注释细胞对该特征提取编码器进行微调。
我们在来自不同技术、物种和器官的60个不同数据集上评估了scPretrain,在细胞类型分类和细胞聚类方面都取得了显著改进。此外,scPretrain在预训练步骤中获得的表征也提高了传统分类器(如随机森林、逻辑回归和支持向量机)的性能。scPretrain能够有效利用大量未标记数据,并应用于注释越来越多生成的scRNA-seq数据集。
本文的基础数据和代码可在scPretrain:用于细胞类型分类的多任务自监督学习中获取,网址为https://github.com/ruiyi-zhang/scPretrain和https://zenodo.org/record/5802306。
补充数据可在《生物信息学》在线获取。