Panaccione Francesca Pia, Mongardi Sofia, Masseroli Marco, Pinoli Pietro
Department of Electronics, Information, and Bioengineering, Politecnico di Milano, 20133 Milan, Italy.
Bioengineering (Basel). 2025 Jun 16;12(6):658. doi: 10.3390/bioengineering12060658.
The advancement of computational genomics has significantly enhanced the use of data-driven solutions in disease prediction and precision medicine. Yet, challenges such as data scarcity, privacy constraints, and biases persist. Synthetic data generation offers a promising solution to these issues. However, existing approaches based on generative artificial intelligence often fail to incorporate biological knowledge, limiting the realism and utility of generated samples. In this work, we present BioGAN, a novel generative framework that, for the first time, incorporates graph neural networks into a generative adversarial network architecture for transcriptomic data generation. By leveraging gene regulatory and co-expression networks, our model preserves biological properties in the generated transcriptomic profiles. We validate its effectiveness on and human gene expression datasets through extensive experiments using unsupervised and supervised evaluation metrics. The results demonstrate that incorporating a priori biological knowledge is an effective strategy for enhancing both the quality and utility of synthetic transcriptomic data. On human data, BioGAN achieves a 4.3% improvement in precision and an up to 2.6% higher correlation with real profiles compared to state-of-the-art models. In downstream disease and tissue classification tasks, our synthetic data improves prediction performance by an average of 5.7%. Results on further confirm BioGAN's robustness, showing consistently strong recall and predictive utility.
计算基因组学的发展显著提高了数据驱动解决方案在疾病预测和精准医学中的应用。然而,数据稀缺、隐私限制和偏差等挑战依然存在。合成数据生成为此类问题提供了一个有前景的解决方案。然而,现有的基于生成式人工智能的方法往往未能纳入生物学知识,限制了生成样本的真实性和实用性。在这项工作中,我们提出了BioGAN,这是一种新颖的生成框架,首次将图神经网络纳入生成对抗网络架构以用于转录组数据生成。通过利用基因调控和共表达网络,我们的模型在生成的转录组谱中保留了生物学特性。我们使用无监督和监督评估指标,通过广泛的实验在[具体数据集]和人类基因表达数据集上验证了其有效性。结果表明,纳入先验生物学知识是提高合成转录组数据质量和实用性的有效策略。在人类数据上,与最先进的模型相比,BioGAN的精度提高了4.3%,与真实谱的相关性提高了高达2.6%。在下游疾病和组织分类任务中,我们的合成数据平均提高了5.7%的预测性能。[具体数据集]上的结果进一步证实了BioGAN的稳健性,显示出始终很强的召回率和预测效用。