Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT 06520, USA.
Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
Bioinformatics. 2022 Aug 10;38(16):3942-3949. doi: 10.1093/bioinformatics/btac427.
With the advancement of technology, we can generate and access large-scale, high dimensional and diverse genomics data, especially through single-cell RNA sequencing (scRNA-seq). However, integrative downstream analysis from multiple scRNA-seq datasets remains challenging due to batch effects.
In this article, we propose a light-structured deep learning framework called ResPAN for scRNA-seq data integration. ResPAN is based on Wasserstein Generative Adversarial Network (WGAN) combined with random walk mutual nearest neighbor pairing and fully skip-connected autoencoders to reduce the differences among batches. We also discuss the limitations of existing methods and demonstrate the advantages of our model over seven other methods through extensive benchmarking studies on both simulated data under various scenarios and real datasets across different scales. Our model achieves leading performance on both batch correction and biological information conservation and maintains scalable to datasets with over half a million cells.
An open-source implementation of ResPAN and scripts to reproduce the results can be downloaded from: https://github.com/AprilYuge/ResPAN.
Supplementary data are available at Bioinformatics online.
随着技术的进步,我们可以生成和访问大规模、高维且多样化的基因组学数据,特别是通过单细胞 RNA 测序 (scRNA-seq)。然而,由于批次效应,来自多个 scRNA-seq 数据集的综合下游分析仍然具有挑战性。
在本文中,我们提出了一种名为 ResPAN 的轻结构深度学习框架,用于 scRNA-seq 数据集成。ResPAN 基于 Wasserstein 生成对抗网络 (WGAN),结合随机游走互最近邻配对和全跳过连接自动编码器,以减少批次之间的差异。我们还讨论了现有方法的局限性,并通过在各种场景下的模拟数据和不同规模的真实数据集上进行广泛的基准研究,展示了我们的模型相对于其他七种方法的优势。我们的模型在批次校正和生物信息保留方面都具有领先性能,并能够扩展到超过五十万个细胞的数据集。
ResPAN 的开源实现和重现结果的脚本可从以下网址下载:https://github.com/AprilYuge/ResPAN。
补充数据可在生物信息学在线获得。