Computational Biosciences Program, School of Mathematics and Statistical Sciences, Arizona State University, 1711 South Rural Road, Tempe, Arizona 85287-1804, USA.
BMC Med Genet. 2012 Jan 25;13:7. doi: 10.1186/1471-2350-13-7.
The discovery of genetic associations is an important factor in the understanding of human illness to derive disease pathways. Identifying multiple interacting genetic mutations associated with disease remains challenging in studying the etiology of complex diseases. And although recently new single nucleotide polymorphisms (SNPs) at genes implicated in immune response, cholesterol/lipid metabolism, and cell membrane processes have been confirmed by genome-wide association studies (GWAS) to be associated with late-onset Alzheimer's disease (LOAD), a percentage of AD heritability continues to be unexplained. We try to find other genetic variants that may influence LOAD risk utilizing data mining methods.
Two different approaches were devised to select SNPs associated with LOAD in a publicly available GWAS data set consisting of three cohorts. In both approaches, single-locus analysis (logistic regression) was conducted to filter the data with a less conservative p-value than the Bonferroni threshold; this resulted in a subset of SNPs used next in multi-locus analysis (random forest (RF)). In the second approach, we took into account prior biological knowledge, and performed sample stratification and linkage disequilibrium (LD) in addition to logistic regression analysis to preselect loci to input into the RF classifier construction step.
The first approach gave 199 SNPs mostly associated with genes in calcium signaling, cell adhesion, endocytosis, immune response, and synaptic function. These SNPs together with APOE and GAB2 SNPs formed a predictive subset for LOAD status with an average error of 9.8% using 10-fold cross validation (CV) in RF modeling. Nineteen variants in LD with ST5, TRPC1, ATG10, ANO3, NDUFA12, and NISCH respectively, genes linked directly or indirectly with neurobiology, were identified with the second approach. These variants were part of a model that included APOE and GAB2 SNPs to predict LOAD risk which produced a 10-fold CV average error of 17.5% in the classification modeling.
With the two proposed approaches, we identified a large subset of SNPs in genes mostly clustered around specific pathways/functions and a smaller set of SNPs, within or in proximity to five genes not previously reported, that may be relevant for the prediction/understanding of AD.
遗传关联的发现是理解人类疾病以推导疾病途径的重要因素。在研究复杂疾病的病因学时,确定与疾病相关的多个相互作用的遗传突变仍然具有挑战性。尽管最近通过全基因组关联研究(GWAS)证实了与迟发性阿尔茨海默病(LOAD)相关的基因中涉及免疫反应、胆固醇/脂质代谢和细胞膜过程的新的单核苷酸多态性(SNP),但 AD 遗传率的一部分仍然无法解释。我们尝试利用数据挖掘方法寻找其他可能影响 LOAD 风险的遗传变异。
设计了两种不同的方法来从包含三个队列的公开可用 GWAS 数据集选择与 LOAD 相关的 SNP。在这两种方法中,都进行了单基因座分析(逻辑回归),以过滤具有比 Bonferroni 阈值更保守的 p 值的数据;这导致了下一步多基因座分析(随机森林(RF))使用的 SNP 子集。在第二种方法中,我们考虑了先前的生物学知识,并进行了样本分层和连锁不平衡(LD)分析,除了逻辑回归分析之外,还对预先选择输入 RF 分类器构建步骤的基因座进行了预选择。
第一种方法给出了 199 个 SNP,主要与钙信号、细胞黏附、内吞作用、免疫反应和突触功能相关的基因相关。这些 SNP 与 APOE 和 GAB2 SNP 一起,在 RF 建模的 10 倍交叉验证(CV)中,使用平均误差 9.8%形成 LOAD 状态的预测子集。通过第二种方法,确定了与 ST5、TRPC1、ATG10、ANO3、NDUFA12 和 NISCH 分别连锁的 19 个变异,这些基因直接或间接与神经生物学相关。这些变体是包括 APOE 和 GAB2 SNP 的模型的一部分,用于预测 LOAD 风险,在分类建模中,10 倍 CV 平均误差为 17.5%。
通过这两种方法,我们在主要围绕特定途径/功能聚类的基因中确定了 SNP 的一个大子集,以及在五个以前未报道的基因内或附近的 SNP 的一个较小子集,这些 SNP 可能与 AD 的预测/理解有关。