Suppr超能文献

解读生成对抗网络以从遗传数据中推断自然选择

INTERPRETING GENERATIVE ADVERSARIAL NETWORKS TO INFER NATURAL SELECTION FROM GENETIC DATA.

作者信息

Riley Rebecca, Mathieson Iain, Mathieson Sara

机构信息

Department of Computer Science, Haverford College, Haverford PA, 19041 USA.

Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia PA, 19104 USA.

出版信息

bioRxiv. 2023 Jul 9:2023.03.07.531546. doi: 10.1101/2023.03.07.531546.

Abstract

Understanding natural selection in humans and other species is a major focus for the use of machine learning in population genetics. Existing methods rely on computationally intensive simulated training data. Unlike efficient neutral coalescent simulations for demographic inference, realistic simulations of selection typically requires slow forward simulations. Because there are many possible modes of selection, a high dimensional parameter space must be explored, with no guarantee that the simulated models are close to the real processes. Mismatches between simulated training data and real test data can lead to incorrect inference. Finally, it is difficult to interpret trained neural networks, leading to a lack of understanding about what features contribute to classification. Here we develop a new approach to detect selection that requires relatively few selection simulations during training. We use a Generative Adversarial Network (GAN) trained to simulate realistic neutral data. The resulting GAN consists of a generator (fitted demographic model) and a discriminator (convolutional neural network). For a genomic region, the discriminator predicts whether it is "real" or "fake" in the sense that it could have been simulated by the generator. As the "real" training data includes regions that experienced selection and the generator cannot produce such regions, regions with a high probability of being real are likely to have experienced selection. To further incentivize this behavior, we "fine-tune" the discriminator with a small number of selection simulations. We show that this approach has high power to detect selection in simulations, and that it finds regions under selection identified by state-of-the art population genetic methods in three human populations. Finally, we show how to interpret the trained networks by clustering hidden units of the discriminator based on their correlation patterns with known summary statistics. In summary, our approach is a novel, efficient, and powerful way to use machine learning to detect natural selection.

摘要

理解人类和其他物种中的自然选择是机器学习在群体遗传学中应用的一个主要重点。现有方法依赖于计算密集型的模拟训练数据。与用于人口统计学推断的高效中性合并模拟不同,选择的现实模拟通常需要缓慢的正向模拟。由于存在许多可能的选择模式,必须探索高维参数空间,而且不能保证模拟模型接近真实过程。模拟训练数据与真实测试数据之间的不匹配可能导致错误推断。最后,难以解释经过训练的神经网络,导致对哪些特征有助于分类缺乏了解。在这里,我们开发了一种新的检测选择的方法,该方法在训练期间需要相对较少的选择模拟。我们使用经过训练以模拟现实中性数据的生成对抗网络(GAN)。生成的GAN由一个生成器(拟合的人口模型)和一个判别器(卷积神经网络)组成。对于一个基因组区域,判别器从它是否可以由生成器模拟的意义上预测它是“真实的”还是“虚假的”。由于“真实”训练数据包括经历过选择的区域,而生成器无法生成这样的区域,具有高概率为真实的区域很可能经历过选择。为了进一步促进这种行为,我们用少量的选择模拟对判别器进行“微调”。我们表明,这种方法在模拟中具有很高的检测选择的能力,并且它在三个人类群体中找到了由最先进的群体遗传方法识别出的选择区域。最后,我们展示了如何通过根据判别器的隐藏单元与已知汇总统计量的相关模式进行聚类来解释经过训练的网络。总之,我们的方法是一种新颖、高效且强大地利用机器学习来检测自然选择的方式。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fea1/10331864/23c9fc903a88/nihpp-2023.03.07.531546v2-f0002.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验