Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
Department of Computer Science, University College London, London, UK.
Bioinformatics. 2022 Jan 12;38(3):730-737. doi: 10.1093/bioinformatics/btab035.
High-throughput gene expression can be used to address a wide range of fundamental biological problems, but datasets of an appropriate size are often unavailable. Moreover, existing transcriptomics simulators have been criticized because they fail to emulate key properties of gene expression data. In this article, we develop a method based on a conditional generative adversarial network to generate realistic transcriptomics data for Escherichia coli and humans. We assess the performance of our approach across several tissues and cancer-types.
We show that our model preserves several gene expression properties significantly better than widely used simulators, such as SynTReN or GeneNetWeaver. The synthetic data preserve tissue- and cancer-specific properties of transcriptomics data. Moreover, it exhibits real gene clusters and ontologies both at local and global scales, suggesting that the model learns to approximate the gene expression manifold in a biologically meaningful way.
Code is available at: https://github.com/rvinas/adversarial-gene-expression.
Supplementary data are available at Bioinformatics online.
高通量基因表达可用于解决广泛的基础生物学问题,但通常无法获得适当大小的数据集。此外,现有的转录组学模拟器因未能模拟基因表达数据的关键特性而受到批评。在本文中,我们开发了一种基于条件生成对抗网络的方法,用于生成大肠杆菌和人类的真实转录组学数据。我们评估了我们的方法在多个组织和癌症类型中的性能。
我们表明,我们的模型在许多基因表达特性上的表现明显优于广泛使用的模拟器,如 SynTReN 或 GeneNetWeaver。合成数据保留了转录组学数据的组织和癌症特异性特性。此外,它在局部和全局尺度上都表现出真实的基因簇和本体,表明该模型学会了以有意义的生物学方式逼近基因表达流形。
代码可在:https://github.com/rvinas/adversarial-gene-expression 获得。
补充数据可在生物信息学在线获得。