Winham Stacey J, Jenkins Gregory D, Biernacka Joanna M
Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America.
Department of Psychiatry and Psychology, Mayo Clinic, Rochester, Minnesota, United States of America.
Genet Epidemiol. 2016 Feb;40(2):123-32. doi: 10.1002/gepi.21946. Epub 2015 Dec 7.
Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).
包括随机森林(RF)在内的机器学习方法越来越多地用于遗传数据分析。然而,标准的RF算法不能正确地对X染色体单核苷酸多态性(SNP)的效应进行建模,导致变量重要性的估计存在偏差。我们提出了RF的扩展方法来正确地对X染色体SNP进行建模,包括一种分层方法和一种基于X染色体失活过程的方法。我们将新的和标准的RF方法应用于成瘾:基因与环境研究(SAGE)中的病例对照酒精依赖数据,并通过模拟研究比较了不同方法的性能。将标准RF应用于酒精依赖的病例对照研究时,即使将性别作为一个变量纳入,对X染色体SNP的变量重要性估计也会过高,但新的RF方法的结果与基于单变量回归且能正确对X染色体数据进行建模的方法一致。模拟结果表明,当性别与性状相关时,新的RF方法消除了标准RF在X染色体SNP变量重要性方面的偏差,并且能够检测出因果性常染色体和X染色体SNP。即使在没有性别效应的情况下,新的扩展方法的表现也与标准RF类似。因此,我们提供了一种强大的多标记遗传分析方法,能够以无偏差的方式处理X染色体数据。该方法在免费的R包“snpRF”(http://www.cran.r-project.org/web/packages/snpRF/)中实现。