Szymczak Silke, Holzinger Emily, Dasgupta Abhijit, Malley James D, Molloy Anne M, Mills James L, Brody Lawrence C, Stambolian Dwight, Bailey-Wilson Joan E
Statistical Genetics Section, Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Dr, 21224 Baltimore, USA ; Current address: Institute of Medical Informatics and Statistics, University of Kiel, Brunswiker Str. 10, 24105 Kiel, Germany.
Statistical Genetics Section, Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Dr, 21224 Baltimore, USA.
BioData Min. 2016 Feb 1;9:7. doi: 10.1186/s13040-016-0087-3. eCollection 2016.
Machine learning methods and in particular random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures (VIMs) to rank SNPs according to their predictive power. However, in contrast to the established genome-wide significance threshold, no clear criteria exist to determine how many SNPs should be selected for downstream analyses.
We propose a new variable selection approach, recurrent relative variable importance measure (r2VIM). Importance values are calculated relative to an observed minimal importance score for several runs of RF and only SNPs with large relative VIMs in all of the runs are selected as important. Evaluations on simulated GWAS data show that the new method controls the number of false-positives under the null hypothesis. Under a simple alternative hypothesis with several independent main effects it is only slightly less powerful than logistic regression. In an experimental GWAS data set, the same strong signal is identified while the approach selects none of the SNPs in an underpowered GWAS.
The novel variable selection method r2VIM is a promising extension to standard RF for objectively selecting relevant SNPs in GWAS while controlling the number of false-positive results.
机器学习方法,尤其是随机森林(RF),是全基因组关联研究(GWAS)中标准单核苷酸多态性(SNP)分析的一种有前景的替代方法。随机森林提供变量重要性度量(VIM),以根据SNP的预测能力对其进行排序。然而,与既定的全基因组显著性阈值不同,目前尚无明确标准来确定应选择多少个SNP进行下游分析。
我们提出了一种新的变量选择方法,即递归相对变量重要性度量(r2VIM)。重要性值是相对于随机森林多次运行中观察到的最小重要性得分计算得出的,只有在所有运行中具有较大相对VIM的SNP才被选为重要SNP。对模拟GWAS数据的评估表明,新方法在原假设下控制了假阳性的数量。在具有几个独立主效应的简单备择假设下,其效力仅略低于逻辑回归。在一个实验性GWAS数据集中,该方法识别出了相同的强信号,而在一个功效不足的GWAS中,该方法未选择任何SNP。
新型变量选择方法r2VIM是对标准随机森林的一种有前景的扩展,可在控制假阳性结果数量的同时,客观地选择GWAS中的相关SNP。