HDMC：一种用于去除单细胞 RNA-seq 数据中批次效应的新型深度学习框架。

HDMC: a novel deep learning-based framework for removing batch effects in single-cell RNA-seq data.

机构信息

College of Computer Science, Nankai University, 300350 Tianjin, China.

Tianjin Key Laboratory of Network and Data Security Technology, Nankai University, 300350 Tianjin, China.

出版信息

Bioinformatics. 2022 Feb 7;38(5):1295-1303. doi: 10.1093/bioinformatics/btab821.

DOI:10.1093/bioinformatics/btab821

PMID:34864918

Abstract

MOTIVATION

With the development of single-cell RNA sequencing (scRNA-seq) techniques, increasingly more large-scale gene expression datasets become available. However, to analyze datasets produced by different experiments, batch effects among different datasets must be considered. Although several methods have been recently published to remove batch effects in scRNA-seq data, two problems remain to be challenging and not completely solved: (i) how to reduce the distribution differences of different batches more accurately; and (ii) how to align samples from different batches to recover the cell type clusters.

RESULTS

We proposed a novel deep-learning approach, which is a hierarchical distribution-matching framework assisted with contrastive learning to address these two problems. Firstly, we design a hierarchical framework for distribution matching based on a deep autoencoder. This framework employs an adversarial training strategy to match the global distribution of different batches. This provides an improved foundation to further match the local distributions with a maximum mean discrepancy-based loss. For local matching, we divide cells in each batch into clusters and develop a contrastive learning mechanism to simultaneously align similar cluster pairs and keep noisy pairs apart from each other. This allows to obtain clusters with all cells of the same type (true positives), and avoid clusters with cells of different type (false positives). We demonstrate the effectiveness of our method on both simulated and real datasets. Results show that our new method significantly outperforms the state-of-the-art methods and has the ability to prevent overcorrection.

AVAILABILITY AND IMPLEMENTATION

The python code to generate results and figures in this article is available at https://github.com/zhanglabNKU/HDMC, the data underlying this article is also available at this github repository.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

随着单细胞 RNA 测序 (scRNA-seq) 技术的发展，越来越多的大规模基因表达数据集变得可用。然而，为了分析来自不同实验的数据集，必须考虑不同数据集之间的批次效应。尽管最近已经发表了几种方法来去除 scRNA-seq 数据中的批次效应，但仍有两个问题具有挑战性且尚未完全解决：(i) 如何更准确地减少不同批次的分布差异；(ii) 如何对齐来自不同批次的样本以恢复细胞类型聚类。

结果

我们提出了一种新的深度学习方法，这是一个层次分布匹配框架，辅助对比学习来解决这两个问题。首先，我们设计了一个基于深度自动编码器的层次分布匹配框架。该框架采用对抗训练策略来匹配不同批次的全局分布。这为进一步使用基于最大均值差异的损失来匹配局部分布提供了改进的基础。对于局部匹配，我们将每个批次中的细胞划分为聚类，并开发了一种对比学习机制，以同时对齐相似的聚类对，并使噪声对彼此分开。这允许获得具有相同类型的所有细胞的聚类（真阳性），并避免具有不同类型的细胞的聚类（假阳性）。我们在模拟和真实数据集上证明了我们方法的有效性。结果表明，我们的新方法显著优于最先进的方法，并且具有防止过度校正的能力。