Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520-8034, USA.
Bioinformatics. 2010 Mar 15;26(6):831-7. doi: 10.1093/bioinformatics/btq038. Epub 2010 Feb 3.
High-dimensional data are frequently generated in genome-wide association studies (GWAS) and other studies. It is important to identify features such as single nucleotide polymorphisms (SNPs) in GWAS that are associated with a disease. Random forests represent a very useful approach for this purpose, using a variable importance score. This importance score has several shortcomings. We propose an alternative importance measure to overcome those shortcomings.
We characterized the effect of multiple SNPs under various models using our proposed importance measure in random forests, which uses maximal conditional chi-square (MCC) as a measure of association between a SNP and the trait conditional on other SNPs. Based on this importance measure, we employed a permutation test to estimate empirical P-values of SNPs. Our method was compared to a univariate test and the permutation test using the Gini and permutation importance. In simulation, the proposed method performed consistently superior to the other methods in identifying of risk SNPs. In a GWAS of age-related macular degeneration, the proposed method confirmed two significant SNPs (at the genome-wide adjusted level of 0.05). Further analysis showed that these two SNPs conformed with a heterogeneity model. Compared with the existing importance measures, the MCC importance measure is more sensitive to complex effects of risk SNPs by utilizing conditional information on different SNPs. The permutation test with the MCC importance measure provides an efficient way to identify candidate SNPs in GWAS and facilitates the understanding of the etiology between genetic variants and complex diseases.
Supplementary data are available at Bioinformatics online.
全基因组关联研究(GWAS)和其他研究经常会产生高维数据。识别与疾病相关的 GWAS 中单核苷酸多态性(SNP)等特征非常重要。随机森林代表了一种非常有用的方法,使用变量重要性评分。该重要性评分存在几个缺点。我们提出了一种替代的重要性度量方法来克服这些缺点。
我们使用随机森林中的最大条件卡方(MCC)作为 SNP 与性状之间关联的度量,对各种模型下的多个 SNP 的效应进行了特征描述,该度量条件于其他 SNP。基于此重要性度量,我们采用置换检验来估计 SNP 的经验 P 值。我们的方法与单变量检验和置换检验(使用 Gini 和置换重要性)进行了比较。在模拟中,所提出的方法在识别风险 SNP 方面始终优于其他方法。在年龄相关性黄斑变性的 GWAS 中,所提出的方法证实了两个具有统计学意义的 SNP(在全基因组调整的 0.05 水平上)。进一步的分析表明,这两个 SNP 符合异质性模型。与现有的重要性度量相比,MCC 重要性度量通过利用不同 SNP 的条件信息,对风险 SNP 的复杂效应更加敏感。基于 MCC 重要性度量的置换检验为 GWAS 中识别候选 SNP 提供了一种有效的方法,并有助于理解遗传变异与复杂疾病之间的病因关系。
补充数据可在 Bioinformatics 在线获取。