Institute of Genomics, University of Tartu, Tartu, Estonia.
Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia.
PLoS Genet. 2021 Feb 4;17(2):e1009303. doi: 10.1371/journal.pgen.1009303. eCollection 2021 Feb.
Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation in the field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the complex distributions of real genomic datasets and generate novel high-quality artificial genomes (AGs) with none to little privacy loss. We show that our generated AGs replicate characteristics of the source dataset such as allele frequencies, linkage disequilibrium, pairwise haplotype distances and population structure. Moreover, they can also inherit complex features such as signals of selection. To illustrate the promising outcomes of our method, we showed that imputation quality for low frequency alleles can be improved by data augmentation to reference panels with AGs and that the RBM latent space provides a relevant encoding of the data, hence allowing further exploration of the reference dataset and features for solving supervised tasks. Generative models and AGs have the potential to become valuable assets in genetic studies by providing a rich yet compact representation of existing genomes and high-quality, easy-access and anonymous alternatives for private databases.
由于机器学习算法和计算能力的提高,生成模型在广泛的领域取得了突破性进展。尽管取得了这些令人印象深刻的成就,但生成模型在遗传学中创造逼真的合成数据的能力仍未得到充分利用,在群体遗传学中更是如此。然而,该领域的一个已知局限性是,由于担心侵犯个人隐私,许多遗传数据库的访问受到限制,尽管它们将为数据挖掘和集成提供丰富的资源,以推进遗传研究。在这项研究中,我们证明深度生成对抗网络(GAN)和受限玻尔兹曼机(RBM)可以被训练来学习真实基因组数据集的复杂分布,并生成新颖的高质量人工基因组(AG),而不会造成隐私泄露。我们展示了我们生成的 AG 可以复制源数据集的特征,例如等位基因频率、连锁不平衡、成对单倍型距离和群体结构。此外,它们还可以继承复杂的特征,例如选择信号。为了说明我们方法的有前途的结果,我们表明,通过使用 AG 对参考面板进行数据扩充,可以提高低频等位基因的插补质量,并且 RBM 潜在空间提供了数据的相关编码,从而允许进一步探索参考数据集和解决监督任务的特征。生成模型和 AG 有潜力成为遗传研究的宝贵资产,因为它们提供了现有基因组的丰富而紧凑的表示形式,以及用于私人数据库的高质量、易于访问和匿名的替代品。