Carrillo-Perez Francisco, Pizurica Marija, Zheng Yuanning, Nandi Tarak Nath, Madduri Ravi, Shen Jeanne, Gevaert Olivier
Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine, Stanford, CA, USA.
Internet technology and Data science Lab (IDLab), Ghent University, Ghent, Belgium.
Nat Biomed Eng. 2025 Mar;9(3):320-332. doi: 10.1038/s41551-024-01193-8. Epub 2024 Mar 21.
Training machine-learning models with synthetically generated data can alleviate the problem of data scarcity when acquiring diverse and sufficiently large datasets is costly and challenging. Here we show that cascaded diffusion models can be used to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumours. Alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal papillary cell carcinoma, cervical squamous cell carcinoma, colon adenocarcinoma and glioblastoma. Machine-learning models pretrained with the generated synthetic data performed better than models trained from scratch. Synthetic data may accelerate the development of machine-learning models in scarce-data settings and allow for the imputation of missing data modalities.
在获取多样且足够大的数据集成本高昂且具有挑战性时,使用合成生成的数据训练机器学习模型可以缓解数据稀缺问题。在此,我们表明级联扩散模型可用于从人类肿瘤的RNA测序数据的潜在表示中合成逼真的全切片图像块。基因表达的改变影响了生成的合成图像块中的细胞类型组成,正如我们在肺腺癌、肾肾乳头状细胞癌、宫颈鳞状细胞癌、结肠腺癌和胶质母细胞瘤中所展示的那样,其准确保留了细胞类型的分布并维持了在批量RNA测序数据中观察到的细胞比例。用生成的合成数据进行预训练的机器学习模型比从头开始训练的模型表现更好。合成数据可能会加速稀缺数据环境中机器学习模型的开发,并允许对缺失的数据模态进行插补。