Marczyk Michal, Macioszek Agnieszka, Tobiasz Joanna, Polanska Joanna, Zyla Joanna
Department of Data Science and Engineering, Silesian University of Technology, Gliwice, Poland.
Yale Cancer Center, Yale School of Medicine, New Haven, CT, United States.
Front Genet. 2021 Dec 9;12:767358. doi: 10.3389/fgene.2021.767358. eCollection 2021.
A typical genome-wide association study (GWAS) analyzes millions of single-nucleotide polymorphisms (SNPs), several of which are in a region of the same gene. To conduct gene set analysis (GSA), information from SNPs needs to be unified at the gene level. A widely used practice is to use only the most relevant SNP per gene; however, there are other methods of integration that could be applied here. Also, the problem of nonrandom association of alleles at two or more loci is often neglected. Here, we tested the impact of incorporation of different integrations and linkage disequilibrium (LD) correction on the performance of several GSA methods. Matched normal and breast cancer samples from The Cancer Genome Atlas database were used to evaluate the performance of six GSA algorithms: Coincident Extreme Ranks in Numerical Observations (CERNO), Gene Set Enrichment Analysis (GSEA), GSEA-SNP, improved GSEA for GWAS (i-GSEA4GWAS), Meta-Analysis Gene-set Enrichment of variaNT Associations (MAGENTA), and Over-Representation Analysis (ORA). Association of SNPs to phenotype was calculated using modified McNemar's test. Results for SNPs mapped to the same gene were integrated using Fisher and Stouffer methods and compared with the minimum -value method. Four common measures were used to quantify the performance of all combinations of methods. Results of GSA analysis on GWAS were compared to the one performed on gene expression data. Comparing all evaluation metrics across different GSA algorithms, integrations, and LD correction, we highlighted CERNO, and MAGENTA with Stouffer as the most efficient. Applying LD correction increased prioritization and specificity of enrichment outcomes for all tested algorithms. When Fisher or Stouffer were used with LD, sensitivity and reproducibility were also better. Using any integration method was beneficial in comparison with a minimum -value method in specific combinations. The correlation between GSA results from genomic and transcriptomic level was the highest when Stouffer integration was combined with LD correction. We thoroughly evaluated different approaches to GSA in GWAS in terms of performance to guide others to select the most effective combinations. We showed that LD correction and Stouffer integration could increase the performance of enrichment analysis and encourage the usage of these techniques.
典型的全基因组关联研究(GWAS)会分析数百万个单核苷酸多态性(SNP),其中有几个位于同一基因区域。为了进行基因集分析(GSA),SNP的信息需要在基因水平上进行统一。一种广泛使用的做法是每个基因仅使用最相关的SNP;然而,这里也可以应用其他整合方法。此外,两个或更多位点上等位基因的非随机关联问题常常被忽视。在此,我们测试了纳入不同整合方法和连锁不平衡(LD)校正对几种GSA方法性能的影响。使用来自癌症基因组图谱数据库的匹配正常样本和乳腺癌样本,来评估六种GSA算法的性能:数值观察中的重合极端秩(CERNO)、基因集富集分析(GSEA)、GSEA-SNP、改进的GWAS基因集富集分析(i-GSEA4GWAS)、变异关联的元分析基因集富集(MAGENTA)以及过度代表性分析(ORA)。使用修正的麦克尼马尔检验计算SNP与表型的关联。使用费舍尔方法和斯托弗方法整合映射到同一基因的SNP的结果,并与最小值方法进行比较。使用四种常用指标来量化所有方法组合的性能。将GWAS的GSA分析结果与基因表达数据的分析结果进行比较。通过比较不同GSA算法、整合方法和LD校正的所有评估指标,我们突出显示CERNO以及采用斯托弗方法的MAGENTA是最有效的。应用LD校正提高了所有测试算法富集结果的优先级和特异性。当费舍尔方法或斯托弗方法与LD校正一起使用时,敏感性和可重复性也更好。在特定组合中,与最小值方法相比,使用任何整合方法都有益。当斯托弗整合方法与LD校正相结合时,基因组和转录组水平的GSA结果之间的相关性最高。我们从性能方面对GWAS中GSA的不同方法进行了全面评估,以指导他人选择最有效的组合。我们表明,LD校正和斯托弗整合可以提高富集分析的性能,并鼓励使用这些技术。