Bedő Justin, Rawlinson David, Goudey Benjamin, Ong Cheng Soon
NICTA Victoria Research Laboratory, University of Melbourne, Victoria, Australia; Department of Computing and Information Systems, University of Melbourne, Victoria, Australia.
NICTA Victoria Research Laboratory, University of Melbourne, Victoria, Australia; Department of Electrical & Electronic Engineering, University of Melbourne, Victoria, Australia.
PLoS One. 2014 Apr 30;9(4):e93319. doi: 10.1371/journal.pone.0093319. eCollection 2014.
Given the difficulty and effort required to confirm candidate causal SNPs detected in genome-wide association studies (GWAS), there is no practical way to definitively filter false positives. Recent advances in algorithmics and statistics have enabled repeated exhaustive search for bivariate features in a practical amount of time using standard computational resources, allowing us to use cross-validation to evaluate the stability. We performed 10 trials of 2-fold cross-validation of exhaustive bivariate analysis on seven Wellcome-Trust Case-Control Consortium GWAS datasets, comparing the traditional [Formula: see text] test for association, the high-performance GBOOST method and the recently proposed GSS statistic (Available at http://bioinformatics.research.nicta.com.au/software/gwis/). We use Spearman's correlation to measure the similarity between the folds of cross validation. To compare incomplete lists of ranks we propose an extension to Spearman's correlation. The extension allows us to consider a natural threshold for feature selection where the correlation is zero. This is the first reported cross-validation study of exhaustive bivariate GWAS feature selection. We found that stability between ranked lists from different cross-validation folds was higher for GSS in the majority of diseases. A thorough analysis of the correlation between SNP-frequency and univariate [Formula: see text] score demonstrated that the [Formula: see text] test for association is highly confounded by main effects: SNPs with high univariate significance replicably dominate the ranked results. We show that removal of the univariately significant SNPs improves [Formula: see text] replicability but risks filtering pairs involving SNPs with univariate effects. We empirically confirm that the stability of GSS and GBOOST were not affected by removal of univariately significant SNPs. These results suggest that the GSS and GBOOST tests are successfully targeting bivariate association with phenotype and that GSS is able to reliably detect a larger set of SNP-pairs than GBOOST in the majority of the data we analysed. However, the [Formula: see text] test for association was confounded by main effects.
鉴于在全基因组关联研究(GWAS)中确认候选因果单核苷酸多态性(SNP)所需的难度和工作量,目前没有切实可行的方法来明确过滤假阳性结果。算法和统计学方面的最新进展使得能够在标准计算资源下,在实际可用时间内对双变量特征进行反复穷举搜索,从而让我们能够使用交叉验证来评估稳定性。我们对七个威康信托病例对照协会GWAS数据集进行了2倍交叉验证的穷举双变量分析的10次试验,比较了传统的关联[公式:见原文]检验、高性能的GBOOST方法和最近提出的GSS统计量(可在http://bioinformatics.research.nicta.com.au/software/gwis/获取)。我们使用斯皮尔曼相关性来衡量交叉验证各折之间的相似性。为了比较不完整的排名列表,我们提出了斯皮尔曼相关性的扩展。该扩展使我们能够考虑特征选择的自然阈值,即相关性为零的情况。这是首次报道的关于穷举双变量GWAS特征选择的交叉验证研究。我们发现,在大多数疾病中,GSS在不同交叉验证折的排名列表之间的稳定性更高。对SNP频率与单变量[公式:见原文]得分之间相关性的深入分析表明,关联的[公式:见原文]检验受到主效应的高度混淆:具有高单变量显著性的SNP可重复性地主导排名结果。我们表明,去除单变量显著的SNP可提高[公式:见原文]的可重复性,但存在过滤涉及具有单变量效应SNP的配对的风险。我们通过实证证实,去除单变量显著的SNP不会影响GSS和GBOOST的稳定性。这些结果表明,GSS和GBOOST检验成功地针对了与表型的双变量关联,并且在我们分析的大多数数据中,GSS能够比GBOOST可靠地检测到更大的SNP对集合。然而,关联的[公式:见原文]检验受到主效应的混淆。