Suppr超能文献

scPretrain:用于细胞类型分类的多任务自监督学习

scPretrain: multi-task self-supervised learning for cell-type classification.

作者信息

Zhang Ruiyi, Luo Yunan, Ma Jianzhu, Zhang Ming, Wang Sheng

机构信息

School of EECS, Peking University, Beijing, China.

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

出版信息

Bioinformatics. 2022 Mar 4;38(6):1607-1614. doi: 10.1093/bioinformatics/btac007.

Abstract

MOTIVATION

Rapidly generated scRNA-seq datasets enable us to understand cellular differences and the function of each individual cell at single-cell resolution. Cell-type classification, which aims at characterizing and labeling groups of cells according to their gene expression, is one of the most important steps for single-cell analysis. To facilitate the manual curation process, supervised learning methods have been used to automatically classify cells. Most of the existing supervised learning approaches only utilize annotated cells in the training step while ignoring the more abundant unannotated cells. In this article, we proposed scPretrain, a multi-task self-supervised learning approach that jointly considers annotated and unannotated cells for cell-type classification. scPretrain consists of a pre-training step and a fine-tuning step. In the pre-training step, scPretrain uses a multi-task learning framework to train a feature extraction encoder based on each dataset's pseudo-labels, where only unannotated cells are used. In the fine-tuning step, scPretrain fine-tunes this feature extraction encoder using the limited annotated cells in a new dataset.

RESULTS

We evaluated scPretrain on 60 diverse datasets from different technologies, species and organs, and obtained a significant improvement on both cell-type classification and cell clustering. Moreover, the representations obtained by scPretrain in the pre-training step also enhanced the performance of conventional classifiers, such as random forest, logistic regression and support-vector machines. scPretrain is able to effectively utilize the massive amount of unlabeled data and be applied to annotating increasingly generated scRNA-seq datasets.

AVAILABILITY AND IMPLEMENTATION

The data and code underlying this article are available in scPretrain: Multi-task self-supervised learning for cell type classification, at https://github.com/ruiyi-zhang/scPretrain and https://zenodo.org/record/5802306.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

快速生成的单细胞RNA测序(scRNA-seq)数据集使我们能够在单细胞分辨率下了解细胞差异和每个细胞的功能。细胞类型分类旨在根据细胞的基因表达对细胞群体进行表征和标记,是单细胞分析中最重要的步骤之一。为了便于人工整理过程,监督学习方法已被用于自动对细胞进行分类。现有的大多数监督学习方法在训练步骤中只利用有注释的细胞,而忽略了更丰富的无注释细胞。在本文中,我们提出了scPretrain,这是一种多任务自监督学习方法,在细胞类型分类中联合考虑有注释和无注释的细胞。scPretrain由预训练步骤和微调步骤组成。在预训练步骤中,scPretrain使用多任务学习框架基于每个数据集的伪标签训练一个特征提取编码器,其中只使用无注释的细胞。在微调步骤中,scPretrain使用新数据集中有限的有注释细胞对该特征提取编码器进行微调。

结果

我们在来自不同技术、物种和器官的60个不同数据集上评估了scPretrain,在细胞类型分类和细胞聚类方面都取得了显著改进。此外,scPretrain在预训练步骤中获得的表征也提高了传统分类器(如随机森林、逻辑回归和支持向量机)的性能。scPretrain能够有效利用大量未标记数据,并应用于注释越来越多生成的scRNA-seq数据集。

可用性和实现

本文的基础数据和代码可在scPretrain:用于细胞类型分类的多任务自监督学习中获取,网址为https://github.com/ruiyi-zhang/scPretrain和https://zenodo.org/record/5802306。

补充信息

补充数据可在《生物信息学》在线获取。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验