Manduchi Elisabetta, Orzechowski Patryk R, Ritchie Marylyn D, Moore Jason H
1Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA USA.
2Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA USA.
BioData Min. 2019 Jul 9;12:14. doi: 10.1186/s13040-019-0201-4. eCollection 2019.
The principal line of investigation in Genome Wide Association Studies (GWAS) is the identification of main effects, that is individual Single Nucleotide Polymorphisms (SNPs) which are associated with the trait of interest, independent of other factors. A variety of methods have been proposed to this end, mostly statistical in nature and differing in assumptions and type of model employed. Moreover, for a given model, there may be multiple choices for the SNP genotype encoding. As an alternative to statistical methods, machine learning methods are often applicable. Typically, for a given GWAS, a single approach is selected and utilized to identify potential SNPs of interest. Even when multiple GWAS are combined through meta-analyses within a consortium, each GWAS is typically analyzed with a single approach and the resulting summary statistics are then utilized in meta-analyses.
In this work we use as case studies a Type 2 Diabetes (T2D) and a breast cancer GWAS to explore a diversity of applicable approaches spanning different methods and encoding choices. We assess similarity of these approaches based on the derived ranked lists of SNPs and, for each GWAS, we identify a subset of representative approaches that we use as an ensemble to derive a union list of top SNPs. Among these are SNPs which are identified by multiple approaches as well as several SNPs identified by only one or a few of the less frequently used approaches. The latter include SNPs from established loci and SNPs which have other supporting lines of evidence in terms of their potential relevance to the traits.
Not every main effect analysis method is suitable for every GWAS, but for each GWAS there are typically multiple applicable methods and encoding options. We suggest a workflow for a single GWAS, extensible to multiple GWAS from consortia, where representative approaches are selected among a pool of suitable options, to yield a more comprehensive set of SNPs, potentially including SNPs that would typically be missed with the most popular analyses, but that could provide additional valuable insights for follow-up.
全基因组关联研究(GWAS)的主要研究方向是识别主效应,即与感兴趣的性状相关的单个单核苷酸多态性(SNP),独立于其他因素。为此已经提出了多种方法,这些方法大多本质上是统计学方法,在假设和所采用的模型类型上有所不同。此外,对于给定的模型,SNP基因型编码可能有多种选择。作为统计方法的替代方案,机器学习方法通常也适用。通常,对于给定的GWAS,会选择并使用单一方法来识别潜在的感兴趣SNP。即使通过联盟内的荟萃分析将多个GWAS合并,每个GWAS通常也采用单一方法进行分析,然后将所得的汇总统计数据用于荟萃分析。
在这项工作中,我们以2型糖尿病(T2D)和乳腺癌GWAS作为案例研究,探索一系列适用于不同方法和编码选择的方法。我们根据导出的SNP排名列表评估这些方法的相似性,并且对于每个GWAS,我们识别出一组代表性方法,将其用作一个整体来得出顶级SNP的联合列表。其中包括通过多种方法识别出的SNP,以及仅由一种或几种较少使用的方法识别出的几个SNP。后者包括来自既定基因座的SNP以及就其与性状的潜在相关性而言有其他支持证据的SNP。
并非每种主效应分析方法都适用于每个GWAS,但对于每个GWAS通常有多种适用方法和编码选项。我们建议了一种适用于单个GWAS的工作流程,该流程可扩展到联盟中的多个GWAS,即在一组合适的选项中选择代表性方法,以产生更全面的SNP集合,可能包括通常在最流行的分析中会遗漏的SNP,但这些SNP可为后续研究提供额外有价值的见解。