Carrillo-Perez Francisco, Pizurica Marija, Zheng Yuanning, Nandi Tarak Nath, Madduri Ravi, Shen Jeanne, Gevaert Olivier
bioRxiv. 2023 Jul 10:2023.01.13.523899. doi: 10.1101/2023.01.13.523899.
Data scarcity presents a significant obstacle in the field of biomedicine, where acquiring diverse and sufficient datasets can be costly and challenging. Synthetic data generation offers a potential solution to this problem by expanding dataset sizes, thereby enabling the training of more robust and generalizable machine learning models. Although previous studies have explored synthetic data generation for cancer diagnosis, they have predominantly focused on single modality settings, such as whole-slide image tiles or RNA-Seq data. To bridge this gap, we propose a novel approach, RNA-Cascaded-Diffusion-Model or RNA-CDM, for performing RNA-to-image synthesis in a multi-cancer context, drawing inspiration from successful text-to-image synthesis models used in natural images. In our approach, we employ a variational auto-encoder to reduce the dimensionality of a patient's gene expression profile, effectively distinguishing between different types of cancer. Subsequently, we employ a cascaded diffusion model to synthesize realistic whole-slide image tiles using the latent representation derived from the patient's RNA-Seq data. Our results demonstrate that the generated tiles accurately preserve the distribution of cell types observed in real-world data, with state-of-the-art cell identification models successfully detecting important cell types in the synthetic samples. Furthermore, we illustrate that the synthetic tiles maintain the cell fraction observed in bulk RNA-Seq data and that modifications in gene expression affect the composition of cell types in the synthetic tiles. Next, we utilize the synthetic data generated by RNA-CDM to pretrain machine learning models and observe improved performance compared to training from scratch. Our study emphasizes the potential usefulness of synthetic data in developing machine learning models in sarce-data settings, while also highlighting the possibility of imputing missing data modalities by leveraging the available information. In conclusion, our proposed RNA-CDM approach for synthetic data generation in biomedicine, particularly in the context of cancer diagnosis, offers a novel and promising solution to address data scarcity. By generating synthetic data that aligns with real-world distributions and leveraging it to pretrain machine learning models, we contribute to the development of robust clinical decision support systems and potential advancements in precision medicine.
数据稀缺是生物医药领域的一个重大障碍,在该领域获取多样且充足的数据集可能成本高昂且具有挑战性。合成数据生成通过扩大数据集规模为这一问题提供了一个潜在的解决方案,从而能够训练更强大且更具通用性的机器学习模型。尽管先前的研究已经探索了用于癌症诊断的合成数据生成,但它们主要集中在单模态设置上,例如全切片图像块或RNA测序数据。为了弥补这一差距,我们从自然图像中成功的文本到图像合成模型中汲取灵感,提出了一种新颖的方法,即RNA级联扩散模型(RNA-Cascaded-Diffusion-Model,简称RNA-CDM),用于在多癌症背景下进行RNA到图像的合成。在我们的方法中,我们使用变分自编码器来降低患者基因表达谱的维度,有效地区分不同类型的癌症。随后,我们使用级联扩散模型,利用从患者RNA测序数据导出的潜在表示来合成逼真的全切片图像块。我们的结果表明,生成的图像块准确地保留了在真实数据中观察到的细胞类型分布,最先进的细胞识别模型成功地检测到了合成样本中的重要细胞类型。此外,我们还表明,合成图像块保持了在批量RNA测序数据中观察到的细胞比例,并且基因表达的改变会影响合成图像块中细胞类型的组成。接下来,我们利用RNA-CDM生成的合成数据对机器学习模型进行预训练,并观察到与从头开始训练相比性能有所提高。我们的研究强调了合成数据在稀缺数据环境中开发机器学习模型的潜在有用性,同时也突出了通过利用可用信息来填补缺失数据模态的可能性。总之,我们提出的用于生物医药领域,特别是癌症诊断背景下的合成数据生成的RNA-CDM方法,为解决数据稀缺问题提供了一种新颖且有前景的解决方案。通过生成与真实世界分布一致的合成数据并利用它对机器学习模型进行预训练,我们为强大的临床决策支持系统以及精准医学的潜在进展做出了贡献。