使用随机森林识别预测表型的单核苷酸多态性

Identifying SNPs predictive of phenotype using random forests.

作者信息

Bureau Alexandre, Dupuis Josée, Falls Kathleen, Lunetta Kathryn L, Hayward Brooke, Keith Tim P, Van Eerdewegh Paul

机构信息

Department of Human Genetics, Oscient Pharmaceuticals, Waltham, Massachusetts, USA.

出版信息

Genet Epidemiol. 2005 Feb;28(2):171-82. doi: 10.1002/gepi.20041.

DOI:10.1002/gepi.20041

PMID:15593090

Abstract

There has been a great interest and a few successes in the identification of complex disease susceptibility genes in recent years. Association studies, where a large number of single-nucleotide polymorphisms (SNPs) are typed in a sample of cases and controls to determine which genes are associated with a specific disease, provide a powerful approach for complex disease gene mapping. Genes of interest in those studies may contain large numbers of SNPs that classical statistical methods cannot handle simultaneously without requiring prohibitively large sample sizes. By contrast, high-dimensional nonparametric methods thrive on large numbers of predictors. This work explores the application of one such method, random forests, to the problem of identifying SNPs predictive of the phenotype in the case-control study design. A random forest is a collection of classification trees grown on bootstrap samples of observations, using a random subset of predictors to define the best split at each node. The observations left out of the bootstrap samples are used to estimate prediction error. The importance of a predictor is quantified by the increase in misclassification occurring when the values of the predictor are randomly permuted. We extend the concept of importance to pairs of predictors, to capture joint effects, and we explore the behavior of importance measures over a range of two-locus disease models in the presence of a varying number of SNPs unassociated with the phenotype. We illustrate the application of random forests with a data set of asthma cases and unaffected controls genotyped at 42 SNPs in ADAM33, a previously identified asthma susceptibility gene. SNPs and SNP pairs highly associated with asthma tend to have the highest importance index value, but predictive importance and association do not always coincide.

摘要

近年来，在复杂疾病易感基因的鉴定方面人们兴趣浓厚且取得了一些成功。关联研究是在病例组和对照组样本中对大量单核苷酸多态性（SNP）进行分型，以确定哪些基因与特定疾病相关，它为复杂疾病基因定位提供了一种强大的方法。在这些研究中，感兴趣的基因可能包含大量SNP，经典统计方法若不要求样本量极大就无法同时处理这些SNP。相比之下，高维非参数方法在大量预测变量的情况下表现出色。本文探讨了一种此类方法——随机森林，在病例对照研究设计中用于识别预测表型的SNP问题上的应用。随机森林是由在观测值的自助抽样样本上生长的分类树组成，使用预测变量的随机子集来定义每个节点的最佳分割。自助抽样样本中未使用的观测值用于估计预测误差。预测变量的重要性通过随机置换预测变量值时误分类的增加来量化。我们将重要性的概念扩展到预测变量对，以捕捉联合效应，并在存在不同数量与表型无关的SNP的情况下，探索在一系列两位点疾病模型中重要性度量的行为。我们用一个哮喘病例和未受影响对照组的数据集说明了随机森林的应用，该数据集对先前鉴定的哮喘易感基因ADAM33中的42个SNP进行了基因分型。与哮喘高度相关的SNP和SNP对往往具有最高的重要性指数值，但预测重要性和关联性并不总是一致。