MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology & Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing, 100084, China.
Tsinghua-Peking Center for Life Sciences, Beijing, 100084, China.
Nat Commun. 2022 Oct 17;13(1):6118. doi: 10.1038/s41467-022-33758-z.
Computational tools for integrative analyses of diverse single-cell experiments are facing formidable new challenges including dramatic increases in data scale, sample heterogeneity, and the need to informatively cross-reference new data with foundational datasets. Here, we present SCALEX, a deep-learning method that integrates single-cell data by projecting cells into a batch-invariant, common cell-embedding space in a truly online manner (i.e., without retraining the model). SCALEX substantially outperforms online iNMF and other state-of-the-art non-online integration methods on benchmark single-cell datasets of diverse modalities, (e.g., single-cell RNA sequencing, scRNA-seq, single-cell assay for transposase-accessible chromatin use sequencing, scATAC-seq), especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We showcase SCALEX's advantages by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19 patients, each assembled from diverse data sources and growing with every new data. The online data integration capacity and superior performance makes SCALEX particularly appropriate for large-scale single-cell applications to build upon previous scientific insights.
用于综合分析各种单细胞实验的计算工具正面临着严峻的新挑战,包括数据规模的急剧增加、样本异质性以及需要将新数据与基础数据集进行有意义的交叉参考。在这里,我们提出了 SCALEX,这是一种深度学习方法,它以真正的在线方式(即无需重新训练模型)将细胞投影到批次不变的通用细胞嵌入空间中,从而整合单细胞数据。SCALEX 在各种模式的基准单细胞数据集(例如单细胞 RNA 测序、scRNA-seq、单细胞转座酶可及染色质测序、scATAC-seq)上的在线 iNMF 和其他最先进的非在线集成方法上表现出色,尤其是对于具有部分重叠的数据集,它能够准确地对齐相似的细胞群,同时保留真实的生物学差异。我们通过为人类、小鼠和 COVID-19 患者构建可连续扩展的单细胞图谱来展示 SCALEX 的优势,每个图谱都是由不同的数据源组装而成,并随着每一个新数据的加入而不断增长。在线数据集成能力和卓越的性能使 SCALEX 特别适合大规模的单细胞应用,以建立在以前的科学见解的基础上。