Department of Computer Science, University of California, Irvine, USA.
BMC Bioinformatics. 2011 Apr 15;12:99. doi: 10.1186/1471-2105-12-99.
BACKGROUND: Recently we have witnessed a surge of interest in using genome-wide association studies (GWAS) to discover the genetic basis of complex diseases. Many genetic variations, mostly in the form of single nucleotide polymorphisms (SNPs), have been identified in a wide spectrum of diseases, including diabetes, cancer, and psychiatric diseases. A common theme arising from these studies is that the genetic variations discovered by GWAS can only explain a small fraction of the genetic risks associated with the complex diseases. New strategies and statistical approaches are needed to address this lack of explanation. One such approach is the pathway analysis, which considers the genetic variations underlying a biological pathway, rather than separately as in the traditional GWAS studies. A critical challenge in the pathway analysis is how to combine evidences of association over multiple SNPs within a gene and multiple genes within a pathway. Most current methods choose the most significant SNP from each gene as a representative, ignoring the joint action of multiple SNPs within a gene. This approach leads to preferential identification of genes with a greater number of SNPs. RESULTS: We describe a SNP-based pathway enrichment method for GWAS studies. The method consists of the following two main steps: 1) for a given pathway, using an adaptive truncated product statistic to identify all representative (potentially more than one) SNPs of each gene, calculating the average number of representative SNPs for the genes, then re-selecting the representative SNPs of genes in the pathway based on this number; and 2) ranking all selected SNPs by the significance of their statistical association with a trait of interest, and testing if the set of SNPs from a particular pathway is significantly enriched with high ranks using a weighted Kolmogorov-Smirnov test. We applied our method to two large genetically distinct GWAS data sets of schizophrenia, one from European-American (EA) and the other from African-American (AA). In the EA data set, we found 22 pathways with nominal P-value less than or equal to 0.001 and corresponding false discovery rate (FDR) less than 5%. In the AA data set, we found 11 pathways by controlling the same nominal P-value and FDR threshold. Interestingly, 8 of these pathways overlap with those found in the EA sample. We have implemented our method in a JAVA software package, called SNP Set Enrichment Analysis (SSEA), which contains a user-friendly interface and is freely available at http://cbcl.ics.uci.edu/SSEA. CONCLUSIONS: The SNP-based pathway enrichment method described here offers a new alternative approach for analysing GWAS data. By applying it to schizophrenia GWAS studies, we show that our method is able to identify statistically significant pathways, and importantly, pathways that can be replicated in large genetically distinct samples.
背景:最近,我们见证了利用全基因组关联研究(GWAS)发现复杂疾病遗传基础的兴趣激增。在包括糖尿病、癌症和精神疾病在内的广泛疾病中,已经发现了许多遗传变异,主要以单核苷酸多态性(SNP)的形式存在。这些研究提出的一个共同主题是,GWAS 发现的遗传变异只能解释与复杂疾病相关的遗传风险的一小部分。需要新的策略和统计方法来解决这一解释不足的问题。一种方法是途径分析,它考虑了生物途径下的遗传变异,而不是像传统的 GWAS 研究那样分别考虑。途径分析中的一个关键挑战是如何组合一个基因内多个 SNP 和一个途径内多个基因的关联证据。目前大多数方法选择每个基因中最显著的 SNP 作为代表,忽略了一个基因内多个 SNP 的共同作用。这种方法导致更倾向于鉴定具有更多 SNP 的基因。
结果:我们描述了一种用于 GWAS 研究的基于 SNP 的途径富集方法。该方法包括以下两个主要步骤:1)对于给定的途径,使用自适应截断乘积统计量识别每个基因的所有代表性(可能超过一个)SNP,计算基因的代表性 SNP 的平均数量,然后根据该数量重新选择途径中的基因的代表性 SNP;2)根据与感兴趣的性状的统计关联的显著性对所有选定的 SNP 进行排序,并使用加权的 Kolmogorov-Smirnov 检验来检验特定途径的 SNP 集合是否显著富集了高秩。我们将我们的方法应用于两个来自欧洲裔美国人(EA)和非裔美国人(AA)的大型遗传上不同的精神分裂症 GWAS 数据集。在 EA 数据集,我们发现了 22 个具有名义 P 值小于或等于 0.001 和相应的错误发现率(FDR)小于 5%的途径。在 AA 数据集,我们通过控制相同的名义 P 值和 FDR 阈值找到了 11 个途径。有趣的是,其中 8 个途径与 EA 样本中的途径重叠。我们已经在一个名为 SNP Set Enrichment Analysis(SSEA)的 Java 软件包中实现了我们的方法,它包含一个用户友好的界面,并可在 http://cbcl.ics.uci.edu/SSEA 上免费获得。
结论:这里描述的基于 SNP 的途径富集方法为分析 GWAS 数据提供了一种新的替代方法。通过将其应用于精神分裂症 GWAS 研究,我们表明我们的方法能够识别具有统计学意义的途径,并且重要的是,能够在遗传上不同的大样本中复制的途径。
BMC Bioinformatics. 2011-4-15
Mol Biol Rep. 2012-12-13
BMC Bioinformatics. 2013-1-14
G3 (Bethesda). 2012-9-1
Mol Biol Rep. 2012-4-25
BMC Bioinformatics. 2025-4-19
Genes (Basel). 2023-9-22
Transl Pediatr. 2023-6-30
Genomics Proteomics Bioinformatics. 2023-6
Plants (Basel). 2020-1-2
Nat Commun. 2019-7-26
Nat Rev Neurosci. 2010-7
Am J Hum Genet. 2010-3-25
Bioinformatics. 2009-7-20
Eur J Hum Genet. 2010-1
Epidemiology. 2009-7
N Engl J Med. 2009-4-23