人类基因型到表型的预测：利用非线性模型提高准确性。

Human genotype-to-phenotype predictions: Boosting accuracy with nonlinear models.

机构信息

Skolkovo Institute of Science and Technology, Moscow, Russia.

出版信息

PLoS One. 2022 Aug 31;17(8):e0273293. doi: 10.1371/journal.pone.0273293. eCollection 2022.

DOI:10.1371/journal.pone.0273293

PMID:36044406

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9432766/

Abstract

Genotype-to-phenotype prediction is a central problem of human genetics. In recent years, it has become possible to construct complex predictive models for phenotypes, thanks to the availability of large genome data sets as well as efficient and scalable machine learning tools. In this paper, we make a threefold contribution to this problem. First, we ask if state-of-the-art nonlinear predictive models, such as boosted decision trees, can be more efficient for phenotype prediction than conventional linear models. We find that this is indeed the case if model features include a sufficiently rich set of covariates, but probably not otherwise. Second, we ask if the conventional selection of single nucleotide polymorphisms (SNPs) by genome wide association studies (GWAS) can be replaced by a more efficient procedure, taking into account information in previously selected SNPs. We propose such a procedure, based on a sequential feature importance estimation with decision trees, and show that this approach indeed produced informative SNP sets that are much more compact than when selected with GWAS. Finally, we show that the highest prediction accuracy can ultimately be achieved by ensembling individual linear and nonlinear models. To the best of our knowledge, for some of the phenotypes that we consider (asthma, hypothyroidism), our results are a new state-of-the-art.

摘要

基因型-表型预测是人类遗传学的核心问题。近年来，由于大型基因组数据集的可用性以及高效且可扩展的机器学习工具，构建复杂的表型预测模型成为可能。在本文中，我们在这个问题上做出了三重贡献。首先，我们询问最先进的非线性预测模型（如增强决策树）是否可以比传统的线性模型更有效地进行表型预测。我们发现，如果模型特征包括足够丰富的协变量集，那么情况确实如此，但否则可能并非如此。其次，我们询问是否可以通过更有效的程序（考虑到先前选择的 SNP 中的信息）来替代全基因组关联研究（GWAS）中对单核苷酸多态性（SNP）的常规选择。我们提出了一种基于决策树的顺序特征重要性估计的程序，并表明该方法确实产生了信息量更大的 SNP 集，比 GWAS 选择的 SNP 集紧凑得多。最后，我们表明，通过集成各个线性和非线性模型，可以最终实现最高的预测准确性。据我们所知，对于我们考虑的一些表型（哮喘、甲状腺功能减退症），我们的结果是最新的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49a2/9432766/3973f87a12ef/pone.0273293.g001.jpg

相似文献

Human genotype-to-phenotype predictions: Boosting accuracy with nonlinear models.人类基因型到表型的预测：利用非线性模型提高准确性。

PLoS One. 2022 Aug 31;17(8):e0273293. doi: 10.1371/journal.pone.0273293. eCollection 2022.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

A machine learning pipeline for quantitative phenotype prediction from genotype data.基于基因型数据的定量表型预测的机器学习管道。

BMC Bioinformatics. 2010 Oct 26;11 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-11-S8-S3.

GWABLUP: genome-wide association assisted best linear unbiased prediction of genetic values.GWABLUP：基于全基因组关联的最佳线性无偏遗传预测。

Genet Sel Evol. 2024 Mar 1;56(1):17. doi: 10.1186/s12711-024-00881-y.

Using GWAS summary data to impute traits for genotyped individuals.利用 GWAS 汇总数据对已基因型个体进行表型推断。

HGG Adv. 2023 Apr 12;4(3):100197. doi: 10.1016/j.xhgg.2023.100197. eCollection 2023 Jul 13.

An efficient unified model for genome-wide association studies and genomic selection.一种用于全基因组关联研究和基因组选择的高效统一模型。

Genet Sel Evol. 2017 Aug 24;49(1):64. doi: 10.1186/s12711-017-0338-x.

The Relative Power of Structural Genomic Variation versus SNPs in Explaining the Quantitative Trait Growth in the Marine Teleost .结构基因组变异与单核苷酸多态性在解释海洋硬骨鱼类数量性状生长中的相对作用

Genes (Basel). 2022 Jun 23;13(7):1129. doi: 10.3390/genes13071129.

Use of gene expression and whole-genome sequence information to improve the accuracy of genomic prediction for carcass traits in Hanwoo cattle.利用基因表达和全基因组序列信息提高韩牛胴体性状基因组预测的准确性。

Genet Sel Evol. 2020 Sep 29;52(1):54. doi: 10.1186/s12711-020-00574-2.

Genome-Wide Association Study and Cost-Efficient Genomic Predictions for Growth and Fillet Yield in Nile Tilapia ().全基因组关联研究及尼罗罗非鱼生长和鱼片产量的成本效益基因组预测()。

G3 (Bethesda). 2019 Aug 8;9(8):2597-2607. doi: 10.1534/g3.119.400116.

Biological Prior Knowledge-Embedded Deep Neural Network for Plant Genomic Prediction.用于植物基因组预测的生物先验知识嵌入深度神经网络

Genes (Basel). 2025 Mar 31;16(4):411. doi: 10.3390/genes16040411.

引用本文的文献

A compact encoding of the genome suitable for machine learning prediction of traits and genetic risk scores.一种适用于性状和遗传风险评分机器学习预测的基因组紧凑编码。

BioData Min. 2025 Jun 19;18(1):44. doi: 10.1186/s13040-025-00459-4.

Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models.基于大语言模型的基因型数据知识驱动特征选择与工程

AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:250-259. eCollection 2025.

Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models.利用大语言模型对基因型数据进行知识驱动的特征选择与工程设计

ArXiv. 2025 Apr 16:arXiv:2410.01795v2.

Improving genetic variant identification for quantitative traits using ensemble learning-based approaches.使用基于集成学习的方法改进数量性状的遗传变异识别。

BMC Genomics. 2025 Mar 12;26(1):237. doi: 10.1186/s12864-025-11443-x.

Deep learning captures the effect of epistasis in multifactorial diseases.深度学习捕捉多因素疾病中上位性的影响。

Front Med (Lausanne). 2025 Jan 7;11:1479717. doi: 10.3389/fmed.2024.1479717. eCollection 2024.

Linking Protein Stability to Pathogenicity: Predicting Clinical Significance of Single-Missense Mutations in Ocular Proteins Using Machine Learning.将蛋白质稳定性与致病性联系起来：使用机器学习预测眼部蛋白质中单错义突变的临床意义。

Int J Mol Sci. 2024 Oct 30;25(21):11649. doi: 10.3390/ijms252111649.

Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations.机器学习策略在代表性不足人群中的表型预测改善。

Pac Symp Biocomput. 2024;29:404-418.

Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations.改善代表性不足人群表型预测的机器学习策略

bioRxiv. 2023 Oct 17:2023.10.12.561949. doi: 10.1101/2023.10.12.561949.

本文引用的文献

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.停止为高风险决策解释黑箱机器学习模型，转而使用可解释模型。

Nat Mach Intell. 2019 May;1(5):206-215. doi: 10.1038/s42256-019-0048-x. Epub 2019 May 13.

Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank.通过对英国生物库的外显子组测序推进人类遗传学研究和药物发现。

Nat Genet. 2021 Jul;53(7):942-948. doi: 10.1038/s41588-021-00885-0. Epub 2021 Jun 28.

A new method for exploring gene-gene and gene-environment interactions in GWAS with tree ensemble methods and SHAP values.基于树集成方法和 SHAP 值的 GWAS 中基因-基因和基因-环境相互作用的新探索方法。

BMC Bioinformatics. 2021 May 4;22(1):230. doi: 10.1186/s12859-021-04041-7.

An Interpretable Prediction Model for Identifying N-Methylguanosine Sites Based on XGBoost and SHAP.一种基于XGBoost和SHAP的用于识别N-甲基鸟苷位点的可解释预测模型。

Mol Ther Nucleic Acids. 2020 Aug 25;22:362-372. doi: 10.1016/j.omtn.2020.08.022. eCollection 2020 Dec 4.

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.一种快速且可扩展的大规模超高维稀疏回归框架及其在 UK Biobank 中的应用。

PLoS Genet. 2020 Oct 23;16(10):e1009141. doi: 10.1371/journal.pgen.1009141. eCollection 2020 Oct.

Opening the Black Box: Interpretable Machine Learning for Geneticists.打开黑箱：遗传学家的可解释机器学习。

Trends Genet. 2020 Jun;36(6):442-455. doi: 10.1016/j.tig.2020.03.005. Epub 2020 Apr 17.

Genomic Prediction of 16 Complex Disease Risks Including Heart Attack, Diabetes, Breast and Prostate Cancer.16 种复杂疾病风险的基因组预测，包括心脏病发作、糖尿病、乳腺癌和前列腺癌。

Sci Rep. 2019 Oct 25;9(1):15286. doi: 10.1038/s41598-019-51258-x.

Influence of Genetic Interactions on Polygenic Prediction.遗传相互作用对多基因预测的影响。

G3 (Bethesda). 2020 Jan 7;10(1):109-115. doi: 10.1534/g3.119.400812.

Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits.基于参数和机器学习模型的复杂性状基因组预测的基准测试。

G3 (Bethesda). 2019 Nov 5;9(11):3691-3702. doi: 10.1534/g3.119.400498.

Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in Finnish cases and controls.机器学习鉴定出导致乳腺癌风险的相互作用遗传变异：芬兰病例对照研究。

Sci Rep. 2018 Sep 3;8(1):13149. doi: 10.1038/s41598-018-31573-5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

人类基因型到表型的预测：利用非线性模型提高准确性。

Human genotype-to-phenotype predictions: Boosting accuracy with nonlinear models.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献