Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
Nat Commun. 2023 Jul 12;14(1):4126. doi: 10.1038/s41467-023-39494-2.
Cell state atlases constructed through single-cell RNA-seq and ATAC-seq analysis are powerful tools for analyzing the effects of genetic and drug treatment-induced perturbations on complex cell systems. Comparative analysis of such atlases can yield new insights into cell state and trajectory alterations. Perturbation experiments often require that single-cell assays be carried out in multiple batches, which can introduce technical distortions that confound the comparison of biological quantities between different batches. Here we propose CODAL, a variational autoencoder-based statistical model which uses a mutual information regularization technique to explicitly disentangle factors related to technical and biological effects. We demonstrate CODAL's capacity for batch-confounded cell type discovery when applied to simulated datasets and embryonic development atlases with gene knockouts. CODAL improves the representation of RNA-seq and ATAC-seq modalities, yields interpretable modules of biological variation, and enables the generalization of other count-based generative models to multi-batched data.
通过单细胞 RNA-seq 和 ATAC-seq 分析构建的细胞状态图谱是分析遗传和药物处理诱导的扰动对复杂细胞系统影响的有力工具。对这些图谱进行比较分析可以深入了解细胞状态和轨迹的改变。扰动实验通常需要在多个批次中进行单细胞分析,这可能会引入技术扭曲,从而混淆不同批次之间生物数量的比较。在这里,我们提出了 CODAL,这是一种基于变分自动编码器的统计模型,它使用互信息正则化技术来显式分离与技术和生物效应相关的因素。当应用于模拟数据集和具有基因敲除的胚胎发育图谱时,我们证明了 CODAL 具有在批次混淆的情况下发现细胞类型的能力。CODAL 改善了 RNA-seq 和 ATAC-seq 模态的表示,产生了可解释的生物学变异模块,并使其他基于计数的生成模型能够推广到多批次数据。