Cheng Samantha, Andrew Angeline S, Andrews Peter C, Moore Jason H
Department of Biostatistics and Epidemiology, Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104-6116 USA.
Department of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755 USA.
BioData Min. 2016 Dec 12;9:40. doi: 10.1186/s13040-016-0119-z. eCollection 2016.
Bladder cancer is common disease with a complex etiology that is likely due to many different genetic and environmental factors. The goal of this study was to embrace this complexity using a bioinformatics analysis pipeline designed to use machine learning to measure synergistic interactions between single nucleotide polymorphisms (SNPs) in two genome-wide association studies (GWAS) and then to assess their enrichment within functional groups defined by Gene Ontology. The significance of the results was evaluated using permutation testing and those results that replicated between the two GWAS data sets were reported.
In the first step of our bioinformatics pipeline, we estimated the pairwise synergistic effects of SNPs on bladder cancer risk in both GWAS data sets using Multifactor Dimensionality Reduction (MDR) machine learning method that is designed specifically for this purpose. Statistical significance was assessed using a 1000-fold permutation test. Each single SNP was assigned a -value based on its strongest pairwise association. Each SNP was then mapped to one or more genes using a window of 500 kb upstream and downstream from each gene boundary. This window was chosen to capture as many regulatory variants as possible. Using Exploratory Visual Analysis (EVA), we then carried out a gene set enrichment analysis at the gene level to identify those genes with an overabundance of significant SNPs relative to the size of their mapped regions. Each gene was assigned to a biological functional group defined by Gene Ontology (GO). We next used EVA to evaluate the overabundance of significant genes in biological functional groups. Our study yielded one GO category, carboxy-lysase activity (GO:0016831), that was significant in analyses from both GWAS data sets. Interestingly, only the gamma-glutamyl carboxylase (GGCX) gene from this GO group was significant in both the detection and replication data, highlighting the complexity of the pathway-level effects on risk. The GGCX gene is expressed in the bladder, but has not been previously associated with bladder cancer in univariate GWAS. However, there is some experimental evidence that carboxy-lysase activity might play a role in cancer and that genes in this pathway should be explored as drug targets. This study provides a genetic basis for that observation.
Our machine learning analysis of genetic associations in two GWAS for bladder cancer identified numerous associations with pairs of SNPs. Gene set enrichment analysis found aggregation of risk-associated SNPs in genes and significant genes in GO functional groups. This study supports a role for decarboxylase protein complexes in bladder cancer susceptibility. Previous research has implicated decarboxylases in bladder cancer etiology; however, the genes that we found to be significant in the detection and replication data are not known to have direct influence on bladder cancer, suggesting some novel hypotheses. This study highlights the need for a complex systems approach to the genetic and genomic analysis of common diseases such as cancer.
膀胱癌是一种常见疾病,其病因复杂,可能是由许多不同的遗传和环境因素导致的。本研究的目的是通过一种生物信息学分析流程来处理这种复杂性,该流程旨在利用机器学习来测量两项全基因组关联研究(GWAS)中单个核苷酸多态性(SNP)之间的协同相互作用,然后评估它们在由基因本体论定义的功能组中的富集情况。使用置换检验评估结果的显著性,并报告在两个GWAS数据集中重复出现的结果。
在我们生物信息学流程的第一步中,我们使用专门为此目的设计的多因素降维(MDR)机器学习方法,估计了两个GWAS数据集中SNP对膀胱癌风险的成对协同效应。使用1000倍置换检验评估统计显著性。根据每个单核苷酸多态性最强的成对关联为其分配一个P值。然后,使用每个基因边界上下游500 kb的窗口将每个SNP映射到一个或多个基因。选择这个窗口是为了捕获尽可能多的调控变异。使用探索性视觉分析(EVA),我们在基因水平上进行了基因集富集分析,以识别那些相对于其映射区域大小而言,具有大量显著SNP的基因。每个基因被分配到一个由基因本体论(GO)定义的生物功能组中。接下来,我们使用EVA评估生物功能组中显著基因的富集情况。我们的研究产生了一个GO类别,即羧基裂解酶活性(GO:0016831),在两个GWAS数据集的分析中均具有显著性。有趣的是,在这个GO组中,只有γ-谷氨酰羧化酶(GGCX)基因在检测和重复数据中均具有显著性,这突出了通路水平对风险影响的复杂性。GGCX基因在膀胱中表达,但在单变量GWAS中以前未与膀胱癌相关联。然而,有一些实验证据表明羧基裂解酶活性可能在癌症中起作用,并且该通路中的基因应作为药物靶点进行探索。本研究为这一观察结果提供了遗传基础。
我们对两项膀胱癌GWAS中的遗传关联进行的机器学习分析确定了许多SNP对之间的关联。基因集富集分析发现风险相关SNP在基因中以及GO功能组中的显著基因中聚集。本研究支持脱羧酶蛋白复合物在膀胱癌易感性中的作用。先前的研究已将脱羧酶与膀胱癌病因联系起来;然而,我们发现在检测和重复数据中具有显著性的基因并不已知对膀胱癌有直接影响,这提示了一些新的假设。本研究强调了对癌症等常见疾病进行遗传和基因组分析时采用复杂系统方法的必要性。