Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
J Comput Biol. 2021 May;28(5):501-513. doi: 10.1089/cmb.2020.0439. Epub 2021 Jan 19.
Dimensionality reduction is an important first step in the analysis of single-cell RNA-sequencing (scRNA-seq) data. In addition to enabling the visualization of the profiled cells, such representations are used by many downstream analyses methods ranging from pseudo-time reconstruction to clustering to alignment of scRNA-seq data from different experiments, platforms, and laboratories. Both supervised and unsupervised methods have been proposed to reduce the dimension of scRNA-seq. However, all methods to date are sensitive to batch effects. When batches correlate with cell types, as is often the case, their impact can lead to representations that are batch rather than cell-type specific. To overcome this, we developed a domain adversarial neural network model for learning a reduced dimension representation of scRNA-seq data. The adversarial model tries to simultaneously optimize two objectives. The first is the accuracy of cell-type assignment and the second is the inability to distinguish the batch (domain). We tested the method by using the resulting representation to align several different data sets. As we show, by overcoming batch effects our method was able to correctly separate cell types, improving on several prior methods suggested for this task. Analysis of the top features used by the network indicates that by taking the batch impact into account, the reduced representation is much better able to focus on key genes for each cell type.
降维是单细胞 RNA 测序 (scRNA-seq) 数据分析的重要第一步。除了能够可视化所分析的细胞外,许多下游分析方法(从伪时间重建到聚类,再到不同实验、平台和实验室的 scRNA-seq 数据的对齐)都使用这种表示形式。已经提出了监督和无监督的方法来降低 scRNA-seq 的维度。然而,迄今为止所有的方法都对批次效应很敏感。当批次与细胞类型相关联时(这种情况经常发生),它们的影响可能导致表示形式是批次特异性的,而不是细胞类型特异性的。为了克服这个问题,我们开发了一种用于学习 scRNA-seq 数据降维表示形式的对抗神经网络模型。对抗模型试图同时优化两个目标。第一个是细胞类型分配的准确性,第二个是无法区分批次(域)。我们通过使用所得表示形式来对齐几个不同的数据集来测试该方法。正如我们所展示的,通过克服批次效应,我们的方法能够正确地区分细胞类型,优于为此任务提出的几种先前方法。对网络使用的顶级特征的分析表明,通过考虑批次的影响,降维表示形式能够更好地关注每个细胞类型的关键基因。