评估基于模拟的监督式机器学习用于从基因组数据推断人口统计学参数。

Assessing simulation-based supervised machine learning for demographic parameter inference from genomic data.

作者信息

Quelin Arnaud, Austerlitz Frédéric, Jay Flora

机构信息

UMR 7206 Eco-Anthropologie (EA), CNRS, Muséum National d'Histoire Naturelle, Université Paris Cité, Paris, France.

UMR 9015 - Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), CNRS, INRIA, Université Paris-Saclay, Orsay, France.

出版信息

Heredity (Edinb). 2025 Jun 6. doi: 10.1038/s41437-025-00773-x.

DOI:10.1038/s41437-025-00773-x

PMID:40473775

Abstract

The ever-increasing availability of high-throughput DNA sequences and the development of numerous computational methods have led to considerable advances in our understanding of the evolutionary and demographic history of populations. Several demographic inference methods have been developed to take advantage of these massive genomic data. Simulation-based approaches, such as approximate Bayesian computation (ABC), have proved particularly efficient for complex demographic models. However, taking full advantage of the comprehensive information contained in massive genomic data remains a challenge for demographic inference methods, which generally rely on partial information from these data. Using advanced computational methods, such as machine learning, is valuable for efficiently integrating more comprehensive information. Here, we showed how simulation-based supervised machine learning methods applied to an extensive range of summary statistics are effective in inferring demographic parameters for connected populations. We compared three machine learning (ML) methods: a neural network, the multilayer perceptron (MLP), and two ensemble methods, random forest (RF) and the gradient boosting system XGBoost (XGB), to infer demographic parameters from genomic data under a standard isolation with migration model and a secondary contact model with varying population sizes. We showed that MLP outperformed the other two methods and that, on the basis of permutation feature importance, its predictions involved a larger combination of summary statistics. Moreover, they outperformed all three tested ABC algorithms. Finally, we demonstrated how a method called SHAP, from the field of explainable artificial intelligence, can be used to shed light on the contribution of summary statistics within the ML models.

摘要

高通量DNA序列的可得性不断提高，以及众多计算方法的发展，使得我们在理解种群的进化和人口统计学历史方面取得了显著进展。已经开发了几种人口统计学推断方法来利用这些海量的基因组数据。基于模拟的方法，如近似贝叶斯计算（ABC），已被证明对于复杂的人口统计学模型特别有效。然而，充分利用海量基因组数据中包含的全面信息，对于通常依赖这些数据的部分信息的人口统计学推断方法来说，仍然是一个挑战。使用先进的计算方法，如机器学习，对于有效整合更全面的信息很有价值。在这里，我们展示了基于模拟的监督机器学习方法应用于广泛的汇总统计数据时，如何有效地推断相连种群的人口统计学参数。我们比较了三种机器学习（ML）方法：神经网络、多层感知器（MLP），以及两种集成方法，随机森林（RF）和梯度提升系统XGBoost（XGB），以在标准的隔离迁移模型和具有不同种群大小的二次接触模型下，从基因组数据中推断人口统计学参数。我们表明，MLP优于其他两种方法，并且基于排列特征重要性，其预测涉及更大的汇总统计数据组合。此外，它们优于所有三种测试的ABC算法。最后，我们展示了可解释人工智能领域的一种名为SHAP的方法如何能够用于阐明ML模型中汇总统计数据的贡献。

相似文献

Assessing simulation-based supervised machine learning for demographic parameter inference from genomic data.评估基于模拟的监督式机器学习用于从基因组数据推断人口统计学参数。

Heredity (Edinb). 2025 Jun 6. doi: 10.1038/s41437-025-00773-x.

Supervised Machine Learning Models for Predicting Sepsis-Associated Liver Injury in Patients With Sepsis: Development and Validation Study Based on a Multicenter Cohort Study.用于预测脓毒症患者脓毒症相关肝损伤的监督式机器学习模型：基于多中心队列研究的开发与验证研究

J Med Internet Res. 2025 May 26;27:e66733. doi: 10.2196/66733.

Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.评估慢性阻塞性肺疾病干预措施的比较效果：面向临床医生的网状Meta分析教程

Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Measures implemented in the school setting to contain the COVID-19 pandemic.学校为控制 COVID-19 疫情而采取的措施。

Cochrane Database Syst Rev. 2022 Jan 17;1(1):CD015029. doi: 10.1002/14651858.CD015029.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病：网络荟萃分析。

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

ScITree: Scalable Bayesian inference of transmission tree from epidemiological and genomic data.ScITree：从流行病学和基因组数据中对传播树进行可扩展的贝叶斯推断。

PLoS Comput Biol. 2025 Jun 10;21(6):e1012657. doi: 10.1371/journal.pcbi.1012657. eCollection 2025 Jun.

XGB-BIF: An XGBoost-Driven Biomarker Identification Framework for Detecting Cancer Using Human Genomic Data.XGB-BIF：一种用于利用人类基因组数据检测癌症的基于XGBoost的生物标志物识别框架。

Int J Mol Sci. 2025 Jun 11;26(12):5590. doi: 10.3390/ijms26125590.

Machine learning for detection of diffusion abnormalities-related respiratory changes among normal, overweight, and obese individuals based on BMI and pulmonary ventilation parameters: an observational study.基于BMI和肺通气参数，利用机器学习检测正常、超重和肥胖个体中与扩散异常相关的呼吸变化：一项观察性研究。

BMC Med Inform Decis Mak. 2025 Jul 1;25(1):240. doi: 10.1186/s12911-025-03064-x.

本文引用的文献

On convolutional neural networks for selection inference: Revealing the effect of preprocessing on model learning and the capacity to discover novel patterns.基于卷积神经网络的选择推理研究：揭示预处理对模型学习和发现新规律能力的影响。

PLoS Comput Biol. 2023 Nov 27;19(11):e1010979. doi: 10.1371/journal.pcbi.1010979. eCollection 2023 Nov.

Developing an Evolutionary Baseline Model for Humans: Jointly Inferring Purifying Selection with Population History.为人类建立进化基准模型：共同推断净化选择与种群历史。

Mol Biol Evol. 2023 May 2;40(5). doi: 10.1093/molbev/msad100.

Deep Learning in Population Genetics.群体遗传学中的深度学习。

Genome Biol Evol. 2023 Feb 3;15(2). doi: 10.1093/gbe/evad008.

Neural networks for self-adjusting mutation rate estimation when the recombination rate is unknown.用于在未知重组率时自我调整突变率估计的神经网络。

PLoS Comput Biol. 2022 Aug 3;18(8):e1010407. doi: 10.1371/journal.pcbi.1010407. eCollection 2022 Aug.

Efficient ancestry and mutation simulation with msprime 1.0.利用 msprime 1.0 进行高效的祖先和突变模拟。

Genetics. 2022 Mar 3;220(3). doi: 10.1093/genetics/iyab229.

Extending approximate Bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms using DIYABC Random Forest.使用 DIYABC 随机森林将带监督机器学习的近似贝叶斯计算扩展到使用遗传多态性推断人口历史。

Mol Ecol Resour. 2021 Nov;21(8):2598-2613. doi: 10.1111/1755-0998.13413. Epub 2021 May 21.

Population genomic, climatic and anthropogenic evidence suggest the role of human forces in endangerment of green peafowl ().种群基因组、气候和人为因素的证据表明了人类活动对绿孔雀濒危的影响。

Proc Biol Sci. 2021 Apr 14;288(1948):20210073. doi: 10.1098/rspb.2021.0073. Epub 2021 Apr 7.

Demographic inference.人口推断。

Curr Biol. 2021 Mar 22;31(6):R276-R279. doi: 10.1016/j.cub.2021.01.053.

The Impact of Purifying and Background Selection on the Inference of Population History: Problems and Prospects.净化和背景选择对群体历史推断的影响：问题与展望。

Mol Biol Evol. 2021 Jun 25;38(7):2986-3003. doi: 10.1093/molbev/msab050.

Distinguishing among complex evolutionary models using unphased whole-genome data through random forest approximate Bayesian computation.利用无相位全基因组数据通过随机森林近似贝叶斯计算区分复杂进化模型。

Mol Ecol Resour. 2021 Nov;21(8):2614-2628. doi: 10.1111/1755-0998.13263. Epub 2020 Oct 25.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估基于模拟的监督式机器学习用于从基因组数据推断人口统计学参数。

Assessing simulation-based supervised machine learning for demographic parameter inference from genomic data.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献