Dorani Faramarz, Hu Ting, Woods Michael O, Zhai Guangju
Department of Computer Science, Memorial University, St. John's, Newfoundland and Labrador, Canada.
Faculty of Medicine, Memorial University, St. John's, Newfoundland and Labrador, Canada.
PeerJ. 2018 Oct 29;6:e5854. doi: 10.7717/peerj.5854. eCollection 2018.
Colorectal cancer (CRC) has a high incident rate in both men and women and is affecting millions of people every year. Genome-wide association studies (GWAS) on CRC have successfully revealed common single-nucleotide polymorphisms (SNPs) associated with CRC risk. However, they can only explain a very limited fraction of the disease heritability. One reason may be the common uni-variable analyses in GWAS where genetic variants are examined one at a time. Given the complexity of cancers, the non-additive interaction effects among multiple genetic variants have a potential of explaining the missing heritability. In this study, we employed two powerful ensemble learning algorithms, random forests and gradient boosting machine (GBM), to search for SNPs that contribute to the disease risk through non-additive gene-gene interactions. We were able to find 44 possible susceptibility SNPs that were ranked most significant by both algorithms. Out of those 44 SNPs, 29 are in coding regions. The 29 genes include , , , and , which have been found previously associated with CRC, and and , which are potentially related to CRC since they have known associations with other types of cancer. We performed pairwise and three-way interaction analysis on the 44 SNPs using information theoretical techniques and found 17 pairwise ( < 0.02) and 16 three-way ( ≤ 0.001) interactions among them. Moreover, functional enrichment analysis suggested 16 functional terms or biological pathways that may help us better understand the etiology of the disease.
结直肠癌(CRC)在男性和女性中都有很高的发病率,每年影响着数百万人。对结直肠癌的全基因组关联研究(GWAS)已成功揭示了与结直肠癌风险相关的常见单核苷酸多态性(SNP)。然而,它们只能解释该疾病遗传力中非常有限的一部分。一个原因可能是GWAS中常见的单变量分析,其中一次只检查一个基因变异。鉴于癌症的复杂性,多个基因变异之间的非加性相互作用效应有可能解释缺失的遗传力。在本研究中,我们采用了两种强大的集成学习算法,随机森林和梯度提升机(GBM),以寻找通过非加性基因-基因相互作用导致疾病风险的SNP。我们能够找到44个可能的易感SNP,这两个算法都将它们列为最显著的。在这44个SNP中,29个位于编码区。这29个基因包括之前已发现与结直肠癌相关的 、 、 和 ,以及 和 ,由于它们与其他类型的癌症有已知关联,因此可能与结直肠癌有关。我们使用信息理论技术对这44个SNP进行了成对和三向相互作用分析,发现它们之间有17对成对相互作用( < 0.02)和16个三向相互作用( ≤ 0.001)。此外,功能富集分析提出了16个功能术语或生物学途径,可能有助于我们更好地理解该疾病的病因。