Zhang Xiang, Zou Fei, Wang Wei
Department of Computer Science, University of North Carolina at Chapel Hill.
KDD. 2008:821-829.
Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study.In this paper, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs.
研究定量表型(如身高或体重)与单核苷酸多态性(SNP)之间的关联是生物学中的一个重要问题。为了理解复杂表型的潜在机制,通常需要考虑多个SNP的联合遗传效应。方差分析(ANOVA)测试在关联研究中经常使用。研究基因-基因(SNP对)相互作用的重要发现不断出现在文献中。然而,SNP的数量可能多达数百万个。即使对于SNP对,评估SNP的联合效应也是一项具有挑战性的任务。此外,由于大量SNP之间存在相关性,与简单的Bonferroni校正相比,置换程序更适合用于正确控制家族性错误率并保留定位能力,这大大增加了关联研究的计算成本。在本文中,我们研究了寻找与给定定量表型具有显著关联的SNP对的问题。我们提出了一种高效算法FastANOVA,用于批量对SNP对进行ANOVA测试,该算法还支持大型置换测试。我们推导了SNP对ANOVA测试的一个上界,它可以表示为两项之和。第一项基于单SNP ANOVA测试。第二项基于SNP且与任何表型置换无关。此外,SNP对可以组织成组,每个组共享一个共同的上界。这允许最大程度地重用中间计算、高效的上界估计和有效的SNP对修剪。因此,FastANOVA只需要对少量候选SNP对进行ANOVA测试,而不会有遗漏任何显著SNP对的风险。大量实验表明,FastANOVA比在所有SNP对上进行ANOVA测试所需的暴力实现快几个数量级。