Department of Bio and Brain Engineering, KAIST, Daejeon 305-701, South Korea.
BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S3. doi: 10.1186/1472-6947-13-S1-S3. Epub 2013 Apr 5.
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
由于全基因组关联研究(GWAS)中单个标记的统计功效较低,因此检测复杂疾病的因果单核苷酸多态性(SNP)是一项挑战。有人提出 SNP 组合可以弥补单个标记的低统计功效,但 GWAS 中的 SNP 组合会产生高计算复杂度。
我们旨在通过最优过滤从 GWAS 数据集中检测 2 型糖尿病(T2D)因果 SNP 组合,并发现检测到的 SNP 组合的生物学意义。最优过滤可以通过比较基于各种 Bonferroni 阈值和 p 值范围的 SNP 组合的错误率以及结合连锁不平衡(LD)修剪的 SNP 组合来增强 SNP 组合的统计功效。使用随机森林对最优 SNP 数据集进行变量选择,从 T2D 因果 SNP 组合中选择 T2D 因果 SNP 组合。使用扩展基因集富集分析(GSEA)将 T2D 因果 SNP 组合和全基因组 SNP 映射到功能模块中,同时考虑途径、转录因子(TF)-靶标、miRNA-靶标、基因本体论和蛋白质复合物功能模块。基于功能模块过滤的 SNP 集合的预测错误率是根据扩展 GSEA 从全基因组 SNP 中选择功能模块内的 SNP 来衡量的。
使用最优过滤标准,从惠康信托基金会病例对照研究(WTCCC)GWAS 数据集选择了包含 101 个 SNP 的 T2D 因果 SNP 组合,其错误率为 10.25%。将 101 个 SNP 与已知的 T2D 基因和功能模块匹配,揭示了 T2D 与 SNP 组合之间的关系。基于功能模块过滤的 SNP 集合的预测错误率与随机选择的 SNP 集合和最优过滤的 T2D 因果 SNP 组合的预测错误率相比没有显著差异。
我们提出了一种使用随机森林变量选择从最优 SNP 数据集中检测复杂疾病因果 SNP 组合的方法。映射检测到的 SNP 组合的生物学意义有助于揭示复杂疾病的机制。