Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, Germany.
Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife KY16 9TF, UK.
Genome Biol Evol. 2023 Feb 3;15(2). doi: 10.1093/gbe/evad008.
Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or computationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for population genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently employed in the field comprise discriminative and generative models with fully connected, convolutional, or recurrent layers. Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The application of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of natural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive performance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and creative representation of population genetic data, will provide further opportunities for technological advancements in the field.
群体遗传学正在向数据驱动的学科转变,这要归功于大规模基因组数据的可用性,以及研究日益复杂的进化场景的需要。由于似然和贝叶斯方法变得要么难以处理,要么在计算上不可行,机器学习,特别是深度学习算法,正在成为群体遗传推断的流行技术。这些方法依赖于通过从训练数据集进行表示学习来学习输入数据和正在估计的模型参数之间的非线性关系的算法。目前在该领域中使用的深度学习算法包括具有全连接、卷积或递归层的判别和生成模型。此外,现在有各种各样强大的模拟器可以在复杂场景下生成训练数据。深度学习在实证数据集上的应用主要复制了以前在模式生物中重建人口统计学和自然选择信号的发现。为了展示深度学习解决新挑战的可行性,我们设计了一个分支架构,用于从时间单倍型数据中检测近期平衡选择的信号,该架构在模拟数据上表现出良好的预测性能。对神经网络的可解释性、对不确定训练数据的鲁棒性以及对群体遗传数据的创造性表示的研究,将为该领域的技术进步提供进一步的机会。