Yoo Wonsuk, Ference Brian A, Cote Michele L, Schwartz Ann
Biostatistics and Epidemiology Division, University of Tennessee Health Science Center, 66 N. Pauline St, Suite 633, Memphis, TN 38163, USA.
Int J Appl Sci Technol. 2012 Aug;2(7):268.
Genome wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNPs) that are associated with a variety of common human diseases. Due to the weak marginal effect of most disease-associated SNPs, attention has recently turned to evaluating the combined effect of multiple disease-associated SNPs on the risk of disease. Several recent multigenic studies show potential evidence of applying multigenic approaches in association studies of various diseases including lung cancer. But the question remains as to the best methodology to analyze single nucleotide polymorphisms in multiple genes. In this work, we consider four methods-logistic regression, logic regression, classification tree, and random forests-to compare results for identifying important genes or gene-gene and gene-environmental interactions. To evaluate the performance of four methods, the cross-validation misclassification error and areas under the curves are provided. We performed a simulation study and applied them to the data from a large-scale, population-based, case-control study.
全基因组关联研究(GWAS)已经鉴定出许多与多种常见人类疾病相关的单核苷酸多态性(SNP)。由于大多数疾病相关SNP的边际效应较弱,最近人们的注意力转向评估多个疾病相关SNP对疾病风险的综合影响。最近的几项多基因研究显示了在包括肺癌在内的各种疾病的关联研究中应用多基因方法的潜在证据。但对于分析多个基因中的单核苷酸多态性的最佳方法仍然存在疑问。在这项工作中,我们考虑了四种方法——逻辑回归、逻辑回归、分类树和随机森林——来比较识别重要基因或基因-基因以及基因-环境相互作用的结果。为了评估这四种方法的性能,提供了交叉验证误分类误差和曲线下面积。我们进行了一项模拟研究,并将它们应用于一项大规模、基于人群的病例对照研究的数据。