Ziegler Andreas, DeStefano Anita L, König Inke R, Bardel Claire, Brinza Dumitru, Bull Shelley, Cai Zhaohui, Glaser Beate, Jiang Wei, Lee Kristine E, Li Chuang Xing, Li Jing, Li Xin, Majoram Paul, Meng Yan, Nicodemus Kristin K, Platt Alexander, Schwarz Daniel F, Shi Weilang, Shugart Yin Yao, Stassen Hans H, Sun Yan V, Won Sungho, Wang Wenyi, Wahba Grace, Zagaar Usumah A, Zhao Zhenming
Institut für Medizinische Biometrie und Statistik, Universitätsklinikum Schleswig-Holstein, Universität zu Lübeck, Ratzeburger Allee 160, Lübeck, Germany.
Genet Epidemiol. 2007;31 Suppl 1:S51-60. doi: 10.1002/gepi.20280.
Genome-wide association studies using thousands to hundreds of thousands of single nucleotide polymorphism (SNP) markers and region-wide association studies using a dense panel of SNPs are already in use to identify disease susceptibility genes and to predict disease risk in individuals. Because these tasks become increasingly important, three different data sets were provided for the Genetic Analysis Workshop 15, thus allowing examination of various novel and existing data mining methods for both classification and identification of disease susceptibility genes, gene by gene or gene by environment interaction. The approach most often applied in this presentation group was random forests because of its simplicity, elegance, and robustness. It was used for prediction and for screening for interesting SNPs in a first step. The logistic tree with unbiased selection approach appeared to be an interesting alternative to efficiently select interesting SNPs. Machine learning, specifically ensemble methods, might be useful as pre-screening tools for large-scale association studies because they can be less prone to overfitting, can be less computer processor time intensive, can easily include pair-wise and higher-order interactions compared with standard statistical approaches and can also have a high capability for classification. However, improved implementations that are able to deal with hundreds of thousands of SNPs at a time are required.
全基因组关联研究使用成千上万到数十万的单核苷酸多态性(SNP)标记,区域全基因组关联研究使用密集的SNP面板,这些研究已被用于识别疾病易感基因并预测个体的疾病风险。由于这些任务变得越来越重要,为遗传分析研讨会15提供了三个不同的数据集,从而能够检验各种新颖的和现有的数据挖掘方法,用于疾病易感基因的分类和识别,逐个基因或基因与环境的相互作用分析。在本展示组中最常应用的方法是随机森林,因为它简单、优雅且稳健。它首先用于预测和筛选有趣的SNP。具有无偏选择方法的逻辑树似乎是有效选择有趣SNP的一个有趣替代方法。机器学习,特别是集成方法,可能作为大规模关联研究的预筛选工具很有用,因为它们比标准统计方法更不易过度拟合,计算机处理器时间消耗更少,能轻松纳入成对和高阶相互作用,并且分类能力也很高。然而,需要能够一次处理数十万SNP的改进实现。