使用统计学习工具对单核苷酸多态性以及基因×基因和基因×环境相互作用中涉及的风险因素进行可变重要性评分和排名的稳定性。

Stability of variable importance scores and rankings using statistical learning tools on single-nucleotide polymorphisms and risk factors involved in gene x gene and gene x environment interactions.

作者信息

Nicodemus Kristin K, Wang Wenyi, Shugart Yin Yao

机构信息

Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK.

出版信息

BMC Proc. 2007;1 Suppl 1(Suppl 1):S58. doi: 10.1186/1753-6561-1-s1-s58. Epub 2007 Dec 18.

DOI:10.1186/1753-6561-1-s1-s58

PMID:18466558

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2367584/

Abstract

Risk of complex disorders is thought to be multifactorial, involving interactions between risk factors. However, many genetic studies assess association between disease status and markers one single-nucleotide polymorphism (SNP) at a time, due to the high-dimensional nature of the search space of all possible interactions. Three ensemble methods have been recently proposed for use in high-dimensional data (Monte Carlo logic regression, random forests, and generalized boosted regression). An intuitive way to detect an association between genetic markers and disease status is to use variable importance measures, even though the stability of these measures in the context of a whole-genome association study is unknown. For the simulated data of Problem 3 in the Genetic Analysis Workshop 15 (GAW15), we examined the variability of both rankings and magnitude of variable importance measures using 10 variables simulated to participate in gene x gene and gene x environment interactions. We conducted 500 analyses per method on one randomly selected replicate, tallying the rankings and importance measures for each of the 10 variables of interest. When the simulated effect size was strong, all three methods showed stable rankings and estimates of variable importance. However, under conditions more commonly expected to be encountered in complex diseases, random forests and generalized boosted regression showed more stable estimates of variable importance and variable rankings. Individuals endeavoring to apply statistical learning methods to detect interaction in complex disease studies should perform repeated analyses in order to assure variable importance measures and rankings do not vary greatly, even for statistical learning algorithms that are thought to be stable.

摘要

复杂疾病的风险被认为是多因素的，涉及风险因素之间的相互作用。然而，由于所有可能相互作用的搜索空间具有高维性，许多基因研究一次仅评估疾病状态与单个单核苷酸多态性（SNP）标记之间的关联。最近提出了三种集成方法用于高维数据（蒙特卡罗逻辑回归、随机森林和广义增强回归）。检测基因标记与疾病状态之间关联的一种直观方法是使用变量重要性度量，尽管这些度量在全基因组关联研究背景下的稳定性尚不清楚。对于遗传分析研讨会15（GAW15）中问题3的模拟数据，我们使用模拟参与基因×基因和基因×环境相互作用的10个变量，研究了变量重要性度量的排名和大小的变异性。我们对一个随机选择的重复样本每种方法进行500次分析，统计10个感兴趣变量中每个变量的排名和重要性度量。当模拟效应大小很强时，所有三种方法都显示出稳定的排名和变量重要性估计。然而，在复杂疾病中更常见的条件下，随机森林和广义增强回归显示出更稳定的变量重要性估计和变量排名。试图应用统计学习方法在复杂疾病研究中检测相互作用的个体应该进行重复分析，以确保即使对于被认为稳定的统计学习算法，变量重要性度量和排名也不会有太大变化。

相似文献

Stability of variable importance scores and rankings using statistical learning tools on single-nucleotide polymorphisms and risk factors involved in gene x gene and gene x environment interactions.

BMC Proc. 2007;1 Suppl 1(Suppl 1):S58. doi: 10.1186/1753-6561-1-s1-s58. Epub 2007 Dec 18.

SNP interaction detection with Random Forests in high-dimensional genetic data.

BMC Bioinformatics. 2012 Jul 15;13:164. doi: 10.1186/1471-2105-13-164.

Identification of SNP interactions using logic regression.

Biostatistics. 2008 Jan;9(1):187-98. doi: 10.1093/biostatistics/kxm024. Epub 2007 Jun 19.

Kernel-Based Measure of Variable Importance for Genetic Association Studies.

Int J Biostat. 2017 Jun 17;13(2):/j/ijb.2017.13.issue-2/ijb-2016-0087/ijb-2016-0087.xml. doi: 10.1515/ijb-2016-0087.

Evaluation of tree-based statistical learning methods for constructing genetic risk scores.

BMC Bioinformatics. 2022 Mar 21;23(1):97. doi: 10.1186/s12859-022-04634-w.

Do little interactions get lost in dark random forests?

BMC Bioinformatics. 2016 Mar 31;17:145. doi: 10.1186/s12859-016-0995-8.

Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias.

Genet Epidemiol. 2016 Feb;40(2):123-32. doi: 10.1002/gepi.21946. Epub 2015 Dec 7.

Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle.

J Dairy Sci. 2013 Oct;96(10):6716-29. doi: 10.3168/jds.2012-6237. Epub 2013 Aug 9.

Identification of interactions of binary variables associated with survival time using survivalFS.

Arch Toxicol. 2019 Mar;93(3):585-602. doi: 10.1007/s00204-019-02398-6. Epub 2019 Jan 29.

Predictor correlation impacts machine learning algorithms: implications for genomic studies.

Bioinformatics. 2009 Aug 1;25(15):1884-90. doi: 10.1093/bioinformatics/btp331. Epub 2009 May 21.

引用本文的文献

Identifying interactions among factors related to death occurred at the scene of traffic accidents: Application of "logic regression" method.

Heliyon. 2024 Jun 5;10(11):e32469. doi: 10.1016/j.heliyon.2024.e32469. eCollection 2024 Jun 15.

Cost-Effectiveness of Peer- Versus Venue-Based Approaches for Detecting Undiagnosed HIV Among Heterosexuals in High-Risk New York City Neighborhoods.

J Acquir Immune Defic Syndr. 2018 Feb 1;77(2):183-192. doi: 10.1097/QAI.0000000000001578.

Immunologic profiles distinguish aviremic HIV-infected adults.

AIDS. 2016 Jun 19;30(10):1553-62. doi: 10.1097/QAD.0000000000001049.

Random forests for genetic association studies.

Stat Appl Genet Mol Biol. 2011;10(1):32. doi: 10.2202/1544-6115.1691. Epub 2011 Jul 12.

Biological validation of increased schizophrenia risk with NRG1, ERBB4, and AKT1 epistasis via functional neuroimaging in healthy controls.

Arch Gen Psychiatry. 2010 Oct;67(10):991-1001. doi: 10.1001/archgenpsychiatry.2010.117.

Evidence of statistical epistasis between DISC1, CIT and NDEL1 impacting risk for schizophrenia: biological validation with functional neuroimaging.

Hum Genet. 2010 Apr;127(4):441-52. doi: 10.1007/s00439-009-0782-y.

本文引用的文献

Nonparametric pathway-based regression models for analysis of genomic data.

Biostatistics. 2007 Apr;8(2):265-84. doi: 10.1093/biostatistics/kxl007. Epub 2006 Jun 13.

Identifying interacting SNPs using Monte Carlo logic regression.

Genet Epidemiol. 2005 Feb;28(2):157-70. doi: 10.1002/gepi.20042.

Sequence analysis using logic regression.

Genet Epidemiol. 2001;21 Suppl 1:S626-31. doi: 10.1002/gepi.2001.21.s1.s626.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用统计学习工具对单核苷酸多态性以及基因×基因和基因×环境相互作用中涉及的风险因素进行可变重要性评分和排名的稳定性。

Stability of variable importance scores and rankings using statistical learning tools on single-nucleotide polymorphisms and risk factors involved in gene x gene and gene x environment interactions.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献