Quelin Arnaud, Austerlitz Frédéric, Jay Flora
UMR 7206 Eco-Anthropologie (EA), CNRS, Muséum National d'Histoire Naturelle, Université Paris Cité, Paris, France.
UMR 9015 - Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), CNRS, INRIA, Université Paris-Saclay, Orsay, France.
Heredity (Edinb). 2025 Jun 6. doi: 10.1038/s41437-025-00773-x.
The ever-increasing availability of high-throughput DNA sequences and the development of numerous computational methods have led to considerable advances in our understanding of the evolutionary and demographic history of populations. Several demographic inference methods have been developed to take advantage of these massive genomic data. Simulation-based approaches, such as approximate Bayesian computation (ABC), have proved particularly efficient for complex demographic models. However, taking full advantage of the comprehensive information contained in massive genomic data remains a challenge for demographic inference methods, which generally rely on partial information from these data. Using advanced computational methods, such as machine learning, is valuable for efficiently integrating more comprehensive information. Here, we showed how simulation-based supervised machine learning methods applied to an extensive range of summary statistics are effective in inferring demographic parameters for connected populations. We compared three machine learning (ML) methods: a neural network, the multilayer perceptron (MLP), and two ensemble methods, random forest (RF) and the gradient boosting system XGBoost (XGB), to infer demographic parameters from genomic data under a standard isolation with migration model and a secondary contact model with varying population sizes. We showed that MLP outperformed the other two methods and that, on the basis of permutation feature importance, its predictions involved a larger combination of summary statistics. Moreover, they outperformed all three tested ABC algorithms. Finally, we demonstrated how a method called SHAP, from the field of explainable artificial intelligence, can be used to shed light on the contribution of summary statistics within the ML models.
高通量DNA序列的可得性不断提高,以及众多计算方法的发展,使得我们在理解种群的进化和人口统计学历史方面取得了显著进展。已经开发了几种人口统计学推断方法来利用这些海量的基因组数据。基于模拟的方法,如近似贝叶斯计算(ABC),已被证明对于复杂的人口统计学模型特别有效。然而,充分利用海量基因组数据中包含的全面信息,对于通常依赖这些数据的部分信息的人口统计学推断方法来说,仍然是一个挑战。使用先进的计算方法,如机器学习,对于有效整合更全面的信息很有价值。在这里,我们展示了基于模拟的监督机器学习方法应用于广泛的汇总统计数据时,如何有效地推断相连种群的人口统计学参数。我们比较了三种机器学习(ML)方法:神经网络、多层感知器(MLP),以及两种集成方法,随机森林(RF)和梯度提升系统XGBoost(XGB),以在标准的隔离迁移模型和具有不同种群大小的二次接触模型下,从基因组数据中推断人口统计学参数。我们表明,MLP优于其他两种方法,并且基于排列特征重要性,其预测涉及更大的汇总统计数据组合。此外,它们优于所有三种测试的ABC算法。最后,我们展示了可解释人工智能领域的一种名为SHAP的方法如何能够用于阐明ML模型中汇总统计数据的贡献。