MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China.
School of Life Sciences and School of Medicine, Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China.
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae518.
Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate the limited availability of data, generative models have been proposed to computationally generate synthetic scRNA-seq data. Nevertheless, the data generated with current models are not very realistic yet, especially when we need to generate data with controlled conditions. In the meantime, diffusion models have shown their power in generating data with high fidelity, providing a new opportunity for scRNA-seq generation.
In this study, we developed scDiffusion, a generative model combining the diffusion model and foundation model to generate high-quality scRNA-seq data with controlled conditions. We designed multiple classifiers to guide the diffusion process simultaneously, enabling scDiffusion to generate data under multiple condition combinations. We also proposed a new control strategy called Gradient Interpolation. This strategy allows the model to generate continuous trajectories of cell development from a given cell state. Experiments showed that scDiffusion could generate single-cell gene expression data closely resembling real scRNA-seq data. Also, scDiffusion can conditionally produce data on specific cell types including rare cell types. Furthermore, we could use the multiple-condition generation of scDiffusion to generate cell type that was out of the training data. Leveraging the Gradient Interpolation strategy, we generated a continuous developmental trajectory of mouse embryonic cells. These experiments demonstrate that scDiffusion is a powerful tool for augmenting the real scRNA-seq data and can provide insights into cell fate research.
scDiffusion is openly available at the GitHub repository https://github.com/EperLuo/scDiffusion or Zenodo https://zenodo.org/doi/10.5281/zenodo.13268742.
单细胞 RNA 测序 (scRNA-seq) 数据对于研究单细胞水平的生命规律非常重要。然而,获得足够高质量的 scRNA-seq 数据仍然具有挑战性。为了缓解数据的有限可用性,已经提出了生成模型来计算生成合成 scRNA-seq 数据。然而,当前模型生成的数据还不是非常真实,特别是当我们需要生成具有受控条件的数据时。与此同时,扩散模型在生成高保真度的数据方面显示出了其强大的能力,为 scRNA-seq 的生成提供了新的机会。
在本研究中,我们开发了 scDiffusion,这是一种结合扩散模型和基础模型的生成模型,可生成具有受控条件的高质量 scRNA-seq 数据。我们设计了多个分类器来同时指导扩散过程,使 scDiffusion 能够在多个条件组合下生成数据。我们还提出了一种新的控制策略,称为梯度插值。该策略允许模型从给定的细胞状态生成细胞发育的连续轨迹。实验表明,scDiffusion 可以生成与真实 scRNA-seq 数据非常相似的单细胞基因表达数据。此外,scDiffusion 可以有条件地生成特定细胞类型(包括稀有细胞类型)的数据。此外,我们可以使用 scDiffusion 的多条件生成来生成不在训练数据中的细胞类型。利用梯度插值策略,我们生成了小鼠胚胎细胞的连续发育轨迹。这些实验表明,scDiffusion 是增强真实 scRNA-seq 数据的有力工具,并可以为细胞命运研究提供深入的见解。
scDiffusion 可在 GitHub 存储库 https://github.com/EperLuo/scDiffusion 或 Zenodo https://zenodo.org/doi/10.5281/zenodo.13268742 上公开获得。