Reumann Matthias, Makalic Enes, Goudey Benjamin W, Inouye Michael, Bickerstaffe Adrian, Bui Minh, Park Daniel J, Kapuscinski Miroslaw K, Schmidt Daniel F, Zhou Zeyu, Qian Guoqi, Zobel Justin, Wagner John, Hopper John L
IBM Research Collaboratory for Life Sciences Melbourne, 187 Grattan Street, Carlton, VIC 3010, Australia.
Annu Int Conf IEEE Eng Med Biol Soc. 2012;2012:1258-61. doi: 10.1109/EMBC.2012.6346166.
Most published GWAS do not examine SNP interactions due to the high computational complexity of computing p-values for the interaction terms. Our aim is to utilize supercomputing resources to apply complex statistical techniques to the world's accumulating GWAS, epidemiology, survival and pathology data to uncover more information about genetic and environmental risk, biology and aetiology. We performed the Bayesian Posterior Probability test on a pseudo data set with 500,000 single nucleotide polymorphism and 100 samples as proof of principle. We carried out strong scaling simulations on 2 to 4,096 processing cores with factor 2 increments in partition size. On two processing cores, the run time is 317h, i.e. almost two weeks, compared to less than 10 minutes on 4,096 processing cores. The speedup factor is 2,020 that is very close to the theoretical value of 2,048. This work demonstrates the feasibility of performing exhaustive higher order analysis of GWAS studies using independence testing for contingency tables. We are now in a position to employ supercomputers with hundreds of thousands of threads for higher order analysis of GWAS data using complex statistics.
由于计算交互项的p值具有很高的计算复杂性,大多数已发表的全基因组关联研究(GWAS)并未研究单核苷酸多态性(SNP)之间的相互作用。我们的目标是利用超级计算资源,将复杂的统计技术应用于全球不断积累的GWAS、流行病学、生存和病理学数据,以揭示更多关于遗传和环境风险、生物学及病因学的信息。作为原理验证,我们在一个包含50万个单核苷酸多态性和100个样本的伪数据集上进行了贝叶斯后验概率测试。我们在2到4096个处理核心上进行了强缩放模拟,分区大小以2倍的增量增加。在两个处理核心上,运行时间为317小时,即近两周,而在4096个处理核心上则不到10分钟。加速因子为2020,非常接近理论值2048。这项工作证明了使用列联表独立性检验对GWAS研究进行详尽高阶分析的可行性。我们现在能够使用具有数十万线程的超级计算机,利用复杂统计对GWAS数据进行高阶分析。