r2VIM：全基因组关联研究中随机森林的一种新变量选择方法。

r2VIM: A new variable selection method for random forests in genome-wide association studies.

作者信息

Szymczak Silke, Holzinger Emily, Dasgupta Abhijit, Malley James D, Molloy Anne M, Mills James L, Brody Lawrence C, Stambolian Dwight, Bailey-Wilson Joan E

机构信息

Statistical Genetics Section, Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Dr, 21224 Baltimore, USA ; Current address: Institute of Medical Informatics and Statistics, University of Kiel, Brunswiker Str. 10, 24105 Kiel, Germany.

Statistical Genetics Section, Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Dr, 21224 Baltimore, USA.

出版信息

BioData Min. 2016 Feb 1;9:7. doi: 10.1186/s13040-016-0087-3. eCollection 2016.

DOI:10.1186/s13040-016-0087-3

PMID:26839594

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4736152/

Abstract

BACKGROUND

Machine learning methods and in particular random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures (VIMs) to rank SNPs according to their predictive power. However, in contrast to the established genome-wide significance threshold, no clear criteria exist to determine how many SNPs should be selected for downstream analyses.

RESULTS

We propose a new variable selection approach, recurrent relative variable importance measure (r2VIM). Importance values are calculated relative to an observed minimal importance score for several runs of RF and only SNPs with large relative VIMs in all of the runs are selected as important. Evaluations on simulated GWAS data show that the new method controls the number of false-positives under the null hypothesis. Under a simple alternative hypothesis with several independent main effects it is only slightly less powerful than logistic regression. In an experimental GWAS data set, the same strong signal is identified while the approach selects none of the SNPs in an underpowered GWAS.

CONCLUSIONS

The novel variable selection method r2VIM is a promising extension to standard RF for objectively selecting relevant SNPs in GWAS while controlling the number of false-positive results.

摘要

背景

机器学习方法，尤其是随机森林（RF），是全基因组关联研究（GWAS）中标准单核苷酸多态性（SNP）分析的一种有前景的替代方法。随机森林提供变量重要性度量（VIM），以根据SNP的预测能力对其进行排序。然而，与既定的全基因组显著性阈值不同，目前尚无明确标准来确定应选择多少个SNP进行下游分析。

结果

我们提出了一种新的变量选择方法，即递归相对变量重要性度量（r2VIM）。重要性值是相对于随机森林多次运行中观察到的最小重要性得分计算得出的，只有在所有运行中具有较大相对VIM的SNP才被选为重要SNP。对模拟GWAS数据的评估表明，新方法在原假设下控制了假阳性的数量。在具有几个独立主效应的简单备择假设下，其效力仅略低于逻辑回归。在一个实验性GWAS数据集中，该方法识别出了相同的强信号，而在一个功效不足的GWAS中，该方法未选择任何SNP。

结论

新型变量选择方法r2VIM是对标准随机森林的一种有前景的扩展，可在控制假阳性结果数量的同时，客观地选择GWAS中的相关SNP。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b112/4736152/9e249bc0dc13/13040_2016_87_Fig1_HTML.jpg

相似文献

r2VIM: A new variable selection method for random forests in genome-wide association studies.

BioData Min. 2016 Feb 1;9:7. doi: 10.1186/s13040-016-0087-3. eCollection 2016.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

Variable selection method for the identification of epistatic models.

Pac Symp Biocomput. 2015;20:195-206.

Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data.

BMC Proc. 2016 Oct 18;10(Suppl 7):147-152. doi: 10.1186/s12919-016-0021-1. eCollection 2016.

Utilizing Deep Learning and Genome Wide Association Studies for Epistatic-Driven Preterm Birth Classification in African-American Women.

IEEE/ACM Trans Comput Biol Bioinform. 2020 Mar-Apr;17(2):668-678. doi: 10.1109/TCBB.2018.2868667. Epub 2018 Sep 3.

A comparative study of forest methods for time-to-event data: variable selection and predictive performance.

BMC Med Res Methodol. 2021 Sep 25;21(1):193. doi: 10.1186/s12874-021-01386-8.

Finding type 2 diabetes causal single nucleotide polymorphism combinations and functional modules from genome-wide association data.

BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S3. doi: 10.1186/1472-6947-13-S1-S3. Epub 2013 Apr 5.

SNP-based pathway enrichment analysis for genome-wide association studies.

BMC Bioinformatics. 2011 Apr 15;12:99. doi: 10.1186/1471-2105-12-99.

Maximal conditional chi-square importance in random forests.

Bioinformatics. 2010 Mar 15;26(6):831-7. doi: 10.1093/bioinformatics/btq038. Epub 2010 Feb 3.

Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach.

Comput Struct Biotechnol J. 2023 Sep 1;21:4354-4360. doi: 10.1016/j.csbj.2023.08.033. eCollection 2023.

引用本文的文献

Disentangling soybean GxE effects in an integrated genomic prediction and machine learning-GWAS workflow.

Plant Methods. 2025 Aug 25;21(1):119. doi: 10.1186/s13007-025-01434-0.

Combating Root-Knot Nematodes ( spp.): From Molecular Mechanisms to Resistant Crops.

Plants (Basel). 2025 Apr 27;14(9):1321. doi: 10.3390/plants14091321.

Out of (the) bag-encoding categorical predictors impacts out-of-bag samples.

PeerJ Comput Sci. 2024 Nov 18;10:e2445. doi: 10.7717/peerj-cs.2445. eCollection 2024.

Modeling Chickpea Productivity with Artificial Image Objects and Convolutional Neural Network.

Plants (Basel). 2024 Sep 1;13(17):2444. doi: 10.3390/plants13172444.

Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features.

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad471.

Establishment and Validation of Fourier Transform Infrared Spectroscopy (FT-MIR) Methodology for the Detection of Linoleic Acid in Buffalo Milk.

Foods. 2023 Mar 12;12(6):1199. doi: 10.3390/foods12061199.

Modeling of Flowering Time in with Artificial Image Objects, Convolutional Neural Network and Random Forest.

Plants (Basel). 2022 Dec 1;11(23):3327. doi: 10.3390/plants11233327.

Epi-MEIF: detecting higher order epistatic interactions for complex traits using mixed effect conditional inference forests.

Nucleic Acids Res. 2022 Oct 28;50(19):e114. doi: 10.1093/nar/gkac715.

A Random Forest-Based Genome-Wide Scan Reveals Fertility-Related Candidate Genes and Potential Inter-Chromosomal Epistatic Regions Associated With Age at First Calving in Nellore Cattle.

Front Genet. 2022 May 18;13:834724. doi: 10.3389/fgene.2022.834724. eCollection 2022.

Exploring Machine Learning Algorithms to Unveil Genomic Regions Associated With Resistance to Southern Root-Knot Nematode in Soybeans.

Front Plant Sci. 2022 May 3;13:883280. doi: 10.3389/fpls.2022.883280. eCollection 2022.

本文引用的文献

Second-generation PLINK: rising to the challenge of larger and richer datasets.

Gigascience. 2015 Feb 25;4:7. doi: 10.1186/s13742-015-0047-8. eCollection 2015.

Variable selection method for the identification of epistatic models.

Pac Symp Biocomput. 2015;20:195-206.

SeqSIMLA2: simulating correlated quantitative traits accounting for shared environmental effects in user-specified pedigree structure.

Genet Epidemiol. 2015 Jan;39(1):20-4. doi: 10.1002/gepi.21850. Epub 2014 Sep 22.

Regional replication of association with refractive error on 15q14 and 15q25 in the Age-Related Eye Disease Study cohort.

Mol Vis. 2013 Nov 2;19:2173-86. eCollection 2013.

Meta-analysis of genome-wide association studies in five cohorts reveals common variants in RBFOX1, a regulator of tissue-specific splicing, associated with refractive error.

Hum Mol Genet. 2013 Jul 1;22(13):2754-64. doi: 10.1093/hmg/ddt116. Epub 2013 Mar 7.

Linkage analysis identifies a locus for plasma von Willebrand factor undetected by genome-wide association.

Proc Natl Acad Sci U S A. 2013 Jan 8;110(2):588-93. doi: 10.1073/pnas.1219885110. Epub 2012 Dec 24.

An integrated map of genetic variation from 1,092 human genomes.

Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.

Random forests for genetic association studies.

Stat Appl Genet Mol Biol. 2011;10(1):32. doi: 10.2202/1544-6115.1691. Epub 2011 Jul 12.

An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data.

Bioinformatics. 2012 Oct 15;28(20):2615-23. doi: 10.1093/bioinformatics/bts483. Epub 2012 Jul 30.

Performance of random forests and logic regression methods using mini-exome sequence data.

BMC Proc. 2011 Nov 29;5 Suppl 9(Suppl 9):S104. doi: 10.1186/1753-6561-5-S9-S104.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

r2VIM：全基因组关联研究中随机森林的一种新变量选择方法。

r2VIM: A new variable selection method for random forests in genome-wide association studies.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献