基于树集成方法和 SHAP 值的 GWAS 中基因-基因和基因-环境相互作用的新探索方法。

A new method for exploring gene-gene and gene-environment interactions in GWAS with tree ensemble methods and SHAP values.

机构信息

SINTEF DIGITAL, Forskningsveien 1, 0373, Oslo, Norway.

Department of Mathematical Sciences, Norwegian University of Science and Technology, A. Getz vei 1, 7491, Trondheim, Norway.

出版信息

BMC Bioinformatics. 2021 May 4;22(1):230. doi: 10.1186/s12859-021-04041-7.

DOI:10.1186/s12859-021-04041-7

PMID:33947323

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8097909/

Abstract

BACKGROUND

The identification of gene-gene and gene-environment interactions in genome-wide association studies is challenging due to the unknown nature of the interactions and the overwhelmingly large number of possible combinations. Parametric regression models are suitable to look for prespecified interactions. Nonparametric models such as tree ensemble models, with the ability to detect any unspecified interaction, have previously been difficult to interpret. However, with the development of methods for model explainability, it is now possible to interpret tree ensemble models efficiently and with a strong theoretical basis.

RESULTS

We propose a tree ensemble- and SHAP-based method for identifying as well as interpreting potential gene-gene and gene-environment interactions on large-scale biobank data. A set of independent cross-validation runs are used to implicitly investigate the whole genome. We apply and evaluate the method using data from the UK Biobank with obesity as the phenotype. The results are in line with previous research on obesity as we identify top SNPs previously associated with obesity. We further demonstrate how to interpret and visualize interaction candidates.

CONCLUSIONS

The new method identifies interaction candidates otherwise not detected with parametric regression models. However, further research is needed to evaluate the uncertainties of these candidates. The method can be applied to large-scale biobanks with high-dimensional data.

摘要

背景

由于交互作用的未知性质和可能的组合数量过多，全基因组关联研究中基因-基因和基因-环境交互作用的识别具有挑战性。参数回归模型适合寻找预设的交互作用。树集成模型等非参数模型具有检测任何未指定交互作用的能力，但以前很难解释。然而，随着模型可解释性方法的发展，现在可以有效地解释树集成模型，并具有很强的理论基础。

结果

我们提出了一种基于树集成和 SHAP 的方法，用于在大规模生物库数据中识别和解释潜在的基因-基因和基因-环境交互作用。一组独立的交叉验证运行用于隐式地研究整个基因组。我们使用 UK Biobank 中的肥胖作为表型数据来应用和评估该方法。结果与肥胖的先前研究一致，因为我们确定了之前与肥胖相关的顶级 SNP。我们进一步演示了如何解释和可视化交互作用候选者。

结论

该新方法可以识别参数回归模型无法检测到的交互作用候选者。然而，需要进一步研究来评估这些候选者的不确定性。该方法可以应用于具有高维数据的大规模生物库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/90b7/8097909/4250aada04ed/12859_2021_4041_Fig1_HTML.jpg

相似文献

A new method for exploring gene-gene and gene-environment interactions in GWAS with tree ensemble methods and SHAP values.基于树集成方法和 SHAP 值的 GWAS 中基因-基因和基因-环境相互作用的新探索方法。

BMC Bioinformatics. 2021 May 4;22(1):230. doi: 10.1186/s12859-021-04041-7.

Next-generation analysis of cataracts: determining knowledge driven gene-gene interactions using biofilter, and gene-environment interactions using the Phenx Toolkit*.白内障的下一代分析：使用生物过滤器确定知识驱动的基因-基因相互作用，以及使用Phenx工具包确定基因-环境相互作用*。

Pac Symp Biocomput. 2015:495-505.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

Next-generation analysis of cataracts: determining knowledge driven gene-gene interactions using Biofilter, and gene-environment interactions using the PhenX Toolkit.白内障的下一代分析：使用生物过滤器确定知识驱动的基因-基因相互作用，以及使用PhenX工具包确定基因-环境相互作用。

Pac Symp Biocomput. 2013:147-58.

Using Genetic Marginal Effects to Study Gene-Environment Interactions with GWAS Data.利用遗传边际效应研究 GWAS 数据中的基因-环境相互作用。

Behav Genet. 2021 May;51(3):358-373. doi: 10.1007/s10519-021-10058-8. Epub 2021 Apr 26.

A novel method to identify high order gene-gene interactions in genome-wide association studies: gene-based MDR.一种在全基因组关联研究中识别高阶基因-基因相互作用的新方法：基于基因的多变量数据分析。

BMC Bioinformatics. 2012 Jun 11;13 Suppl 9(Suppl 9):S5. doi: 10.1186/1471-2105-13-S9-S5.

A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies.一种结合基于随机森林的技术和通过潜在变量进行连锁不平衡建模的方法，用于进行多基因座全基因组关联研究。

BMC Bioinformatics. 2018 Mar 27;19(1):106. doi: 10.1186/s12859-018-2054-0.

Gene, pathway and network frameworks to identify epistatic interactions of single nucleotide polymorphisms derived from GWAS data.用于识别源自全基因组关联研究（GWAS）数据的单核苷酸多态性上位性相互作用的基因、通路和网络框架。

BMC Syst Biol. 2012;6 Suppl 3(Suppl 3):S15. doi: 10.1186/1752-0509-6-S3-S15. Epub 2012 Dec 17.

A Fast and Accurate Method for Genome-wide Scale Phenome-wide G × E Analysis and Its Application to UK Biobank.一种用于全基因组规模表型全基因组 G × E 分析的快速准确方法及其在 UK Biobank 中的应用。

Am J Hum Genet. 2019 Dec 5;105(6):1182-1192. doi: 10.1016/j.ajhg.2019.10.008. Epub 2019 Nov 14.

Application of the parametric bootstrap for gene-set analysis of gene-environment interactions.参数引导在基因-环境交互作用的基因集分析中的应用。

Eur J Hum Genet. 2018 Nov;26(11):1679-1686. doi: 10.1038/s41431-018-0236-x. Epub 2018 Aug 8.

引用本文的文献

Advancing genome-based precision medicine: a review on machine learning applications for rare genetic disorders.推进基于基因组的精准医学：关于机器学习在罕见遗传疾病中的应用综述

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf329.

Gene-environment interactions in human health.人类健康中的基因-环境相互作用。

Nat Rev Genet. 2024 Nov;25(11):768-784. doi: 10.1038/s41576-024-00731-z. Epub 2024 May 28.

Genetic Inheritance Models of Non-Syndromic Cleft Lip with or without Palate: From Monogenic to Polygenic.非综合征型唇裂伴或不伴腭裂的遗传继承模型：从单基因到多基因。

Genes (Basel). 2023 Sep 24;14(10):1859. doi: 10.3390/genes14101859.

E-GWAS: an ensemble-like GWAS strategy that provides effective control over false positive rates without decreasing true positives.E-GWAS：一种类似集成的 GWAS 策略，在不降低真阳性率的情况下有效控制假阳性率。

Genet Sel Evol. 2023 Jul 5;55(1):46. doi: 10.1186/s12711-023-00820-3.

Machine learning classifier approaches for predicting response to RTK-type-III inhibitors demonstrate high accuracy using transcriptomic signatures and data.使用转录组特征和数据预测对RTK-III型抑制剂反应的机器学习分类器方法显示出高准确性。

Bioinform Adv. 2023 Mar 22;3(1):vbad034. doi: 10.1093/bioadv/vbad034. eCollection 2023.

Inferring feature importance with uncertainties with application to large genotype data.基于大型基因型数据的不确定性推断特征重要性。

PLoS Comput Biol. 2023 Mar 14;19(3):e1010963. doi: 10.1371/journal.pcbi.1010963. eCollection 2023 Mar.

Gene-gene interaction detection with deep learning.基于深度学习的基因-基因交互作用检测。

Commun Biol. 2022 Nov 12;5(1):1238. doi: 10.1038/s42003-022-04186-y.

Human genotype-to-phenotype predictions: Boosting accuracy with nonlinear models.人类基因型到表型的预测：利用非线性模型提高准确性。

PLoS One. 2022 Aug 31;17(8):e0273293. doi: 10.1371/journal.pone.0273293. eCollection 2022.

Compressive Strength Estimation of Steel-Fiber-Reinforced Concrete and Raw Material Interactions Using Advanced Algorithms.使用先进算法对钢纤维混凝土抗压强度进行估算及原材料相互作用研究

Polymers (Basel). 2022 Jul 29;14(15):3065. doi: 10.3390/polym14153065.

Use of Artificial Intelligence for Predicting Parameters of Sustainable Concrete and Raw Ingredient Effects and Interactions.利用人工智能预测可持续混凝土的参数以及原材料的影响和相互作用。

Materials (Basel). 2022 Jul 27;15(15):5207. doi: 10.3390/ma15155207.

本文引用的文献

From Local Explanations to Global Understanding with Explainable AI for Trees.利用可解释人工智能实现从局部解释到树木的全局理解

Nat Mach Intell. 2020 Jan;2(1):56-67. doi: 10.1038/s42256-019-0138-9. Epub 2020 Jan 17.

Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype.利用基因组数据结构设计深度神经网络，从基因型预测肌萎缩侧索硬化症。

Bioinformatics. 2019 Jul 15;35(14):i538-i547. doi: 10.1093/bioinformatics/btz369.

Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data.使用全基因组基因分型数据对克罗恩病患者进行分类的机器学习方法的比较性能。

Sci Rep. 2019 Jul 17;9(1):10351. doi: 10.1038/s41598-019-46649-z.

Powerful extreme phenotype sampling designs and score tests for genetic association studies.用于遗传关联研究的强大极端表型抽样设计和评分检验。

Stat Med. 2018 Dec 10;37(28):4234-4251. doi: 10.1002/sim.7914. Epub 2018 Aug 7.

Performance of epistasis detection methods in semi-simulated GWAS.连锁不平衡检测方法在半模拟 GWAS 中的性能。

BMC Bioinformatics. 2018 Jun 18;19(1):231. doi: 10.1186/s12859-018-2229-8.

Mixed-model association for biobank-scale datasets.基于生物库规模数据集的混合模型关联分析。

Nat Genet. 2018 Jul;50(7):906-908. doi: 10.1038/s41588-018-0144-6.

The search for gene-gene interactions in genome-wide association studies: challenges in abundance of methods, practical considerations, and biological interpretation.全基因组关联研究中基因-基因相互作用的探索：方法众多带来的挑战、实际考量及生物学解释

Ann Transl Med. 2018 Apr;6(8):157. doi: 10.21037/atm.2018.04.05.

10 Years of GWAS Discovery: Biology, Function, and Translation.全基因组关联研究十年发现：生物学、功能与转化

Am J Hum Genet. 2017 Jul 6;101(1):5-22. doi: 10.1016/j.ajhg.2017.06.005.

Efficient Strategy to Identify Gene-Gene Interactions and Its Application to Type 2 Diabetes.识别基因-基因相互作用的有效策略及其在2型糖尿病中的应用

Genomics Inform. 2016 Dec;14(4):160-165. doi: 10.5808/GI.2016.14.4.160. Epub 2016 Dec 30.

Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia.快速主成分分析揭示了乙醇脱氢酶1B在欧洲和东亚的趋同进化。

Am J Hum Genet. 2016 Mar 3;98(3):456-472. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于树集成方法和 SHAP 值的 GWAS 中基因-基因和基因-环境相互作用的新探索方法。

A new method for exploring gene-gene and gene-environment interactions in GWAS with tree ensemble methods and SHAP values.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献