Suppr超能文献

人类基因型到表型的预测:利用非线性模型提高准确性。

Human genotype-to-phenotype predictions: Boosting accuracy with nonlinear models.

机构信息

Skolkovo Institute of Science and Technology, Moscow, Russia.

出版信息

PLoS One. 2022 Aug 31;17(8):e0273293. doi: 10.1371/journal.pone.0273293. eCollection 2022.

Abstract

Genotype-to-phenotype prediction is a central problem of human genetics. In recent years, it has become possible to construct complex predictive models for phenotypes, thanks to the availability of large genome data sets as well as efficient and scalable machine learning tools. In this paper, we make a threefold contribution to this problem. First, we ask if state-of-the-art nonlinear predictive models, such as boosted decision trees, can be more efficient for phenotype prediction than conventional linear models. We find that this is indeed the case if model features include a sufficiently rich set of covariates, but probably not otherwise. Second, we ask if the conventional selection of single nucleotide polymorphisms (SNPs) by genome wide association studies (GWAS) can be replaced by a more efficient procedure, taking into account information in previously selected SNPs. We propose such a procedure, based on a sequential feature importance estimation with decision trees, and show that this approach indeed produced informative SNP sets that are much more compact than when selected with GWAS. Finally, we show that the highest prediction accuracy can ultimately be achieved by ensembling individual linear and nonlinear models. To the best of our knowledge, for some of the phenotypes that we consider (asthma, hypothyroidism), our results are a new state-of-the-art.

摘要

基因型-表型预测是人类遗传学的核心问题。近年来,由于大型基因组数据集的可用性以及高效且可扩展的机器学习工具,构建复杂的表型预测模型成为可能。在本文中,我们在这个问题上做出了三重贡献。首先,我们询问最先进的非线性预测模型(如增强决策树)是否可以比传统的线性模型更有效地进行表型预测。我们发现,如果模型特征包括足够丰富的协变量集,那么情况确实如此,但否则可能并非如此。其次,我们询问是否可以通过更有效的程序(考虑到先前选择的 SNP 中的信息)来替代全基因组关联研究(GWAS)中对单核苷酸多态性(SNP)的常规选择。我们提出了一种基于决策树的顺序特征重要性估计的程序,并表明该方法确实产生了信息量更大的 SNP 集,比 GWAS 选择的 SNP 集紧凑得多。最后,我们表明,通过集成各个线性和非线性模型,可以最终实现最高的预测准确性。据我们所知,对于我们考虑的一些表型(哮喘、甲状腺功能减退症),我们的结果是最新的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49a2/9432766/3973f87a12ef/pone.0273293.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验