Jia Peilin, Tian Jian, Zhao Zhongming
Departments of Biomedical Informatics and Psychiatry, Vanderbilt University Medical Centre, Nashville, Tennessee 37232, USA.
Int J Comput Biol Drug Des. 2010;3(4):297-310. doi: 10.1504/IJCBDD.2010.038394. Epub 2011 Feb 4.
Genome-Wide Association Studies (GWAS) have rapidly become a major genetics approach to studying complex diseases. Although many susceptibility variants and genes have been uncovered by single marker analysis, gene set based analysis is emerging as a very promising approach aiming to detect joint association of a set of genes with disease. In the available gene set based methods, it is often the smallest P value of the Single Nucleotide Polymorphisms (SNPs) in a gene region is used to represent the gene-level association signal. This approach may introduce strong bias of association signal towards long genes. In this study, we propose a resampling strategy by randomly generating genomic intervals across the accessible genomic region to estimate the background distribution of P values at the gene level. Comparing with the gene-wise P value in real data, the proportion of random intervals could be used to assess the bias that might be introduced by gene length and in turn to help the investigators choose the appropriate gene set analysis algorithms in their GWAS datasets. Our method uses only summarised GWAS data with no need of permutation, thus, it is computationally efficient. A computer program is freely available for the users.
全基因组关联研究(GWAS)已迅速成为研究复杂疾病的主要遗传学方法。尽管通过单标记分析已经发现了许多易感变异和基因,但基于基因集的分析正作为一种非常有前景的方法兴起,旨在检测一组基因与疾病的联合关联。在现有的基于基因集的方法中,通常使用基因区域中单核苷酸多态性(SNP)的最小P值来代表基因水平的关联信号。这种方法可能会对长基因引入强烈的关联信号偏差。在本研究中,我们提出了一种重采样策略,通过在可访问的基因组区域随机生成基因组区间来估计基因水平P值的背景分布。与实际数据中的基因-wise P值相比,随机区间的比例可用于评估可能由基因长度引入的偏差,进而帮助研究人员在其GWAS数据集中选择合适的基因集分析算法。我们的方法仅使用汇总的GWAS数据,无需置换,因此计算效率高。用户可免费获得一个计算机程序。