Zhang Zhaojun, Mathew Divij, Lim Tristan, Mason Kaishu, Martinez Clara Morral, Huang Sijia, Wherry E John, Susztak Katalin, Minn Andy J, Ma Zongming, Zhang Nancy R
Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, PA, United States.
Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, PA, United States.
bioRxiv. 2023 Sep 23:2023.05.05.539614. doi: 10.1101/2023.05.05.539614.
Data integration to align cells across batches has become a cornerstone of single cell data analysis, critically affecting downstream results. Yet, how much biological signal is erased during integration? Currently, there are no guidelines for when the biological differences between samples are separable from batch effects, and thus, data integration usually involve a lot of guesswork: Cells across batches should be aligned to be "appropriately" mixed, while preserving "main cell type clusters". We show evidence that current paradigms for single cell data integration are unnecessarily aggressive, removing biologically meaningful variation. To remedy this, we present a novel statistical model and computationally scalable algorithm, CellANOVA, to recover biological signal that is lost during single cell data integration. CellANOVA utilizes a "pool-of-controls" design concept, applicable across diverse settings, to separate unwanted variation from biological variation of interest. When applied with existing integration methods, CellANOVA allows the recovery of subtle biological signals and corrects, to a large extent, the data distortion introduced by integration. Further, CellANOVA explicitly estimates cell- and gene-specific batch effect terms which can be used to identify the cell types and pathways exhibiting the largest batch variations, providing clarity as to which biological signals can be recovered. These concepts are illustrated on studies of diverse designs, where the biological signals that are recovered by CellANOVA are shown to be validated by orthogonal assays. In particular, we show that CellANOVA is effective in the challenging case of single-cell and single-nuclei data integration, where the recovered biological signals are replicated in an independent study.
跨批次对齐细胞的数据整合已成为单细胞数据分析的基石,对下游结果有着至关重要的影响。然而,在整合过程中会消除多少生物信号呢?目前,对于样本之间的生物学差异何时可与批次效应区分开来尚无指导原则,因此,数据整合通常涉及大量猜测:跨批次的细胞应进行对齐以便“适当地”混合,同时保留“主要细胞类型簇”。我们有证据表明,当前单细胞数据整合的范式过于激进,会消除具有生物学意义的变异。为了弥补这一点,我们提出了一种新颖的统计模型和计算上可扩展的算法CellANOVA,以恢复在单细胞数据整合过程中丢失的生物信号。CellANOVA利用一种“对照池”设计概念,该概念适用于各种情况,以将不需要的变异与感兴趣的生物学变异区分开来。当与现有的整合方法一起应用时,CellANOVA能够恢复细微的生物信号,并在很大程度上纠正整合引入的数据失真。此外,CellANOVA明确估计细胞和基因特异性的批次效应项,这些项可用于识别表现出最大批次变异的细胞类型和途径,从而明确哪些生物信号可以被恢复。这些概念在不同设计的研究中得到了说明,其中CellANOVA恢复的生物信号通过正交分析得到了验证。特别是,我们表明CellANOVA在单细胞和单核数据整合这一具有挑战性的情况下是有效的,其中恢复的生物信号在独立研究中得到了重复。