利用随机森林在高维遗传数据中检测 SNP 相互作用。

SNP interaction detection with Random Forests in high-dimensional genetic data.

机构信息

Department of Health Sciences Research, Mayo Clinic, 200 First Street Southwest, Rochester, MN 55905, USA.

出版信息

BMC Bioinformatics. 2012 Jul 15;13:164. doi: 10.1186/1471-2105-13-164.

DOI:10.1186/1471-2105-13-164

PMID:22793366

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3463421/

Abstract

BACKGROUND

Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.

RESULTS

RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.

CONCLUSIONS

While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.

摘要

背景

在高维数据中识别与复杂人类特征相关的变体是全基因组关联研究的核心目标。然而，这些研究通常应用的单变量分析忽略了基因-基因相互作用等复杂病因。随机森林（RF）是一种流行的数据挖掘技术，可以容纳大量预测变量，并允许具有相互作用的复杂模型。RF 分析产生变量重要性的度量，可以用于对预测变量进行排名。因此，使用 RF 的单核苷酸多态性（SNP）分析作为一种潜在的过滤方法，正在成为一种考虑高维数据中相互作用的方法。然而，数据维度对 RF 识别相互作用的能力的影响尚未得到彻底探讨。我们研究了变量重要性度量的排名在检测基因-基因相互作用效应方面的能力，以及它们作为过滤器与单变量逻辑回归的 p 值相比的潜在有效性，特别是当数据变得越来越高维时。

结果

RF 有效地识别低维数据中的相互作用。随着预测变量总数的增加，相互作用 SNP 的检测概率比非相互作用 SNP 下降得更快，这表明在高维数据中，RF 变量重要性度量捕捉的是边际效应，而不是捕捉相互作用的效应。

结论

虽然 RF 仍然是一种有前途的数据挖掘技术，它将单变量方法扩展到同时对多个变量进行条件处理，但在没有强边际成分的情况下，RF 变量重要性度量无法检测到高维数据中的相互作用效应，因此可能无法作为一种允许全基因组数据中存在相互作用效应的过滤器技术。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f90/3463421/3d1ec7823ab8/1471-2105-13-164-1.jpg

相似文献

SNP interaction detection with Random Forests in high-dimensional genetic data.利用随机森林在高维遗传数据中检测 SNP 相互作用。

BMC Bioinformatics. 2012 Jul 15;13:164. doi: 10.1186/1471-2105-13-164.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

Screening large-scale association study data: exploiting interactions using random forests.筛选大规模关联研究数据：利用随机森林探索相互作用

BMC Genet. 2004 Dec 10;5:32. doi: 10.1186/1471-2156-5-32.

Do little interactions get lost in dark random forests?微小的相互作用会在黑暗的随机森林中消失吗？

BMC Bioinformatics. 2016 Mar 31;17:145. doi: 10.1186/s12859-016-0995-8.

Evaluating the ability of tree-based methods and logistic regression for the detection of SNP-SNP interaction.评估基于树的方法和逻辑回归检测单核苷酸多态性（SNP）-SNP相互作用的能力。

Ann Hum Genet. 2009 May;73(Pt 3):360-9. doi: 10.1111/j.1469-1809.2009.00511.x. Epub 2009 Mar 8.

Data mining of high density genomic variant data for prediction of Alzheimer's disease risk.对高密度基因组变异数据进行数据挖掘，以预测阿尔茨海默病的风险。

BMC Med Genet. 2012 Jan 25;13:7. doi: 10.1186/1471-2350-13-7.

A Weighted Random Forests Approach to Improve Predictive Performance.一种用于提高预测性能的加权随机森林方法。

Stat Anal Data Min. 2013 Dec 1;6(6):496-505. doi: 10.1002/sam.11196.

Power of data mining methods to detect genetic associations and interactions.数据挖掘方法检测基因关联和相互作用的能力。

Hum Hered. 2011;72(2):85-97. doi: 10.1159/000330579. Epub 2011 Sep 17.

Performance of random forest when SNPs are in linkage disequilibrium.单核苷酸多态性处于连锁不平衡状态时随机森林的性能。

BMC Bioinformatics. 2009 Mar 5;10:78. doi: 10.1186/1471-2105-10-78.

Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle.用于识别与奶牛剩余采食量相关的加性和上位性单核苷酸多态性的随机森林方法。

J Dairy Sci. 2013 Oct;96(10):6716-29. doi: 10.3168/jds.2012-6237. Epub 2013 Aug 9.

引用本文的文献

Insights on the evolution and adaptation toward high-altitude and cold environments in the snow leopard lineage.雪豹谱系对高海拔和寒冷环境的进化与适应洞察。

Sci Adv. 2025 Jan 17;11(3):eadp5243. doi: 10.1126/sciadv.adp5243. Epub 2025 Jan 15.

A novel approach to risk exposure and epigenetics-the use of multidimensional context to gain insights into the early origins of cardiometabolic and neurocognitive health.一种新的风险暴露和表观遗传学方法——利用多维背景深入了解心血管代谢和神经认知健康的早期起源。

BMC Med. 2023 Nov 27;21(1):466. doi: 10.1186/s12916-023-03168-z.

Exploring the potential of incremental feature selection to improve genomic prediction accuracy.探索增量特征选择提高基因组预测准确性的潜力。

Genet Sel Evol. 2023 Nov 9;55(1):78. doi: 10.1186/s12711-023-00853-8.

Evidence of secular trends during the COVID-19 pandemic in a stepped wedge cluster randomized trial examining sexual and reproductive health outcomes among Indigenous youth.在一项针对性与生殖健康结果的阶梯式窝块随机对照试验中，有证据表明 COVID-19 大流行期间出现了长期趋势，该试验对原住民青年进行了研究。

Trials. 2023 Apr 1;24(1):248. doi: 10.1186/s13063-023-07223-1.

A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction.基于机器学习的疾病风险预测的特征选择方法综述

Front Bioinform. 2022 Jun 27;2:927312. doi: 10.3389/fbinf.2022.927312. eCollection 2022.

Evaluation of tree-based statistical learning methods for constructing genetic risk scores.基于树的统计学习方法构建遗传风险评分的评估。

BMC Bioinformatics. 2022 Mar 21;23(1):97. doi: 10.1186/s12859-022-04634-w.

MIDESP: Mutual Information-Based Detection of Epistatic SNP Pairs for Qualitative and Quantitative Phenotypes.MIDESP：基于互信息的定性和定量表型上位性SNP对检测

Biology (Basel). 2021 Sep 16;10(9):921. doi: 10.3390/biology10090921.

Searching for improvements in predicting human eye colour from DNA.从 DNA 预测人类眼睛颜色的改进研究

Int J Legal Med. 2021 Nov;135(6):2175-2187. doi: 10.1007/s00414-021-02645-5. Epub 2021 Jul 14.

Investigating factors affecting the interval between a burn and the start of treatment using data mining methods and logistic regression.运用数据挖掘方法和逻辑回归分析影响烧伤至治疗开始时间间隔的因素。

BMC Med Res Methodol. 2021 Apr 14;21(1):71. doi: 10.1186/s12874-021-01270-5.

Prediction of Breast Cancer Treatment-Induced Fatigue by Machine Learning Using Genome-Wide Association Data.利用全基因组关联数据通过机器学习预测乳腺癌治疗引起的疲劳

JNCI Cancer Spectr. 2020 May 11;4(5):pkaa039. doi: 10.1093/jncics/pkaa039. eCollection 2020 Oct.

本文引用的文献

Random forests for genetic association studies.用于基因关联研究的随机森林算法。

Stat Appl Genet Mol Biol. 2011;10(1):32. doi: 10.2202/1544-6115.1691. Epub 2011 Jul 12.

Multigenic modeling of complex disease by random forests.随机森林模型对复杂疾病的多基因建模。

Adv Genet. 2010;72:73-99. doi: 10.1016/B978-0-12-380862-2.00004-7.

An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.随机森林在全基因组关联数据集上的应用：方法学考虑与新发现。

BMC Genet. 2010 Jun 14;11:49. doi: 10.1186/1471-2156-11-49.

On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data.随机森林的随机丛林之旅：一种用于高维数据的随机森林的快速实现。

Bioinformatics. 2010 Jul 15;26(14):1752-8. doi: 10.1093/bioinformatics/btq257. Epub 2010 May 26.

Missing heritability and strategies for finding the underlying causes of complex disease.复杂疾病遗传率缺失及其潜在病因的研究策略。

Nat Rev Genet. 2010 Jun;11(6):446-50. doi: 10.1038/nrg2809.

A genome-wide association study of alcohol dependence.一项关于酒精依赖的全基因组关联研究。

Proc Natl Acad Sci U S A. 2010 Mar 16;107(11):5082-7. doi: 10.1073/pnas.0911109107. Epub 2010 Mar 2.

Finding the missing heritability of complex diseases.寻找复杂疾病中缺失的遗传力。

Nature. 2009 Oct 8;461(7265):747-53. doi: 10.1038/nature08494.

Predictor correlation impacts machine learning algorithms: implications for genomic studies.预测器相关性影响机器学习算法：对基因组研究的启示。

Bioinformatics. 2009 Aug 1;25(15):1884-90. doi: 10.1093/bioinformatics/btp331. Epub 2009 May 21.

Detecting gene-gene interactions that underlie human diseases.检测人类疾病相关的基因-基因相互作用。

Nat Rev Genet. 2009 Jun;10(6):392-404. doi: 10.1038/nrg2579.

Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis.通过模拟蒸发冷却网络分析在基因关联研究中捕捉相互作用效应谱。

PLoS Genet. 2009 Mar;5(3):e1000432. doi: 10.1371/journal.pgen.1000432. Epub 2009 Mar 20.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用随机森林在高维遗传数据中检测 SNP 相互作用。

SNP interaction detection with Random Forests in high-dimensional genetic data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献