Salehe Bajuna Rashid, Jones Chris Ian, Di Fatta Giuseppe, McGuffin Liam James
School of Biological Sciences, University of Reading, Reading, United Kingdom.
Department of Computer Science, University of Reading, Reading, United Kingdom.
PLoS One. 2017 Apr 25;12(4):e0175957. doi: 10.1371/journal.pone.0175957. eCollection 2017.
Advances in omics technologies have led to the discovery of genetic markers, or single nucleotide polymorphisms (SNPs), that are associated with particular diseases or complex traits. Although there have been significant improvements in the approaches used to analyse associations of SNPs with disease, further optimised and rapid techniques are needed to keep up with the rate of SNP discovery, which has exacerbated the 'missing heritability' problem. Here, we have devised a novel, integrated, heuristic-based, hybrid analytical computational pipeline, for rapidly detecting novel or key genetic variants that are associated with diseases or complex traits. Our pipeline is particularly useful in genetic association studies where the genotyped SNP data are highly dimensional, and the complex trait phenotype involved is continuous. In particular, the pipeline is more efficient for investigating small sets of genotyped SNPs defined in high dimensional spaces that may be associated with continuous phenotypes, rather than for the investigation of whole genome variants. The pipeline, which employs a consensus approach based on the random forest, was able to rapidly identify previously unseen key SNPs, that are significantly associated with the platelet response phenotype, which was used as our complex trait case study. Several of these SNPs, such as rs6141803 of COMMD7 and rs41316468 in PKT2B, have independently confirmed associations with cardiovascular diseases (CVDs) according to other unrelated studies, suggesting that our pipeline is robust in identifying key genetic variants. Our new pipeline provides an important step towards addressing the problem of 'missing heritability' through enhanced detection of key genetic variants (SNPs) that are associated with continuous complex traits/disease phenotypes.
组学技术的进步促使人们发现了与特定疾病或复杂性状相关的遗传标记,即单核苷酸多态性(SNP)。尽管在分析SNP与疾病关联的方法上已经取得了显著进展,但仍需要进一步优化和快速的技术来跟上SNP发现的速度,因为这加剧了“遗传力缺失”问题。在此,我们设计了一种新颖的、集成的、基于启发式的混合分析计算流程,用于快速检测与疾病或复杂性状相关的新的或关键的遗传变异。我们的流程在基因关联研究中特别有用,其中基因分型的SNP数据具有高维度,且所涉及的复杂性状表型是连续的。特别是,该流程在研究高维空间中定义的可能与连续表型相关的少量基因分型SNP时更有效,而不是用于研究全基因组变异。该流程采用基于随机森林的共识方法,能够快速识别先前未发现的与血小板反应表型显著相关的关键SNP,我们将血小板反应表型用作复杂性状案例研究。根据其他不相关的研究,这些SNP中的几个,如COMMD7的rs6141803和PKT2B中的rs41316468,已独立证实与心血管疾病(CVD)相关,这表明我们的流程在识别关键遗传变异方面是稳健的。我们的新流程朝着通过增强检测与连续复杂性状/疾病表型相关的关键遗传变异(SNP)来解决“遗传力缺失”问题迈出了重要一步。