模拟遗传分析研讨会19数据中变量选择的参数方法与机器学习方法的比较

Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data.

作者信息

Holzinger Emily R, Szymczak Silke, Malley James, Pugh Elizabeth W, Ling Hua, Griffith Sean, Zhang Peng, Li Qing, Cropp Cheryl D, Bailey-Wilson Joan E

机构信息

Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive, Suite 1200, Baltimore, MD 21224 USA.

Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive, Suite 1200, Baltimore, MD 21224 USA ; Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Christian-Albrechts-Platz 4, 24118 Kiel, Germany.

出版信息

BMC Proc. 2016 Oct 18;10(Suppl 7):147-152. doi: 10.1186/s12919-016-0021-1. eCollection 2016.

DOI:10.1186/s12919-016-0021-1

PMID:27980627

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5133476/

Abstract

Current findings from genetic studies of complex human traits often do not explain a large proportion of the estimated variation of these traits due to genetic factors. This could be, in part, due to overly stringent significance thresholds in traditional statistical methods, such as linear and logistic regression. Machine learning methods, such as Random Forests (RF), are an alternative approach to identify potentially interesting variants. One major issue with these methods is that there is no clear way to distinguish between probable true hits and noise variables based on the importance metric calculated. To this end, we are developing a method called the Relative Recurrency Variable Importance Metric (r2VIM), a RF-based variable selection method. Here, we apply r2VIM to the unrelated Genetic Analysis Workshop 19 data with simulated systolic blood pressure as the phenotype. We compare the number of "true" functional variants identified by r2VIM with those identified by linear regression analyses that use a Bonferroni correction to calculate a significance threshold. Our results show that r2VIM performed comparably to linear regression. Our findings are proof-of-concept for r2VIM, as it identifies a similar number of functional and nonfunctional variants as a more commonly used technique when the optimal importance score threshold is used.

摘要

目前对复杂人类性状的基因研究结果往往无法解释这些性状因遗传因素而产生的估计变异中的很大一部分。部分原因可能是传统统计方法（如线性回归和逻辑回归）中的显著性阈值过于严格。机器学习方法，如随机森林（RF），是识别潜在有趣变异的另一种方法。这些方法的一个主要问题是，基于计算出的重要性指标，没有明确的方法来区分可能的真正命中变量和噪声变量。为此，我们正在开发一种名为相对复发变量重要性指标（r2VIM）的方法，这是一种基于随机森林的变量选择方法。在这里，我们将r2VIM应用于无关的遗传分析研讨会19的数据，以模拟收缩压作为表型。我们将r2VIM识别出的“真正”功能变异的数量与使用Bonferroni校正计算显著性阈值的线性回归分析识别出的数量进行比较。我们的结果表明，r2VIM的表现与线性回归相当。我们的发现为r2VIM提供了概念验证，因为当使用最佳重要性得分阈值时，它识别出的功能和非功能变异数量与一种更常用的技术相似。