Département de Sciences Biologiques, Université de Montréal, Montréal, QC, Canada.
Microb Genom. 2020 Mar;6(3). doi: 10.1099/mgen.0.000337.
Genome-wide association studies (GWASs) have the potential to reveal the genetics of microbial phenotypes such as antibiotic resistance and virulence. Capitalizing on the growing wealth of bacterial sequence data, microbial GWAS methods aim to identify causal genetic variants while ignoring spurious associations. Bacteria reproduce clonally, leading to strong population structure and genome-wide linkage, making it challenging to separate true 'hits' (i.e. mutations that cause a phenotype) from non-causal linked mutations. GWAS methods attempt to correct for population structure in different ways, but their performance has not yet been systematically and comprehensively evaluated under a range of evolutionary scenarios. Here, we developed a bacterial GWAS simulator (BacGWASim) to generate bacterial genomes with varying rates of mutation, recombination and other evolutionary parameters, along with a subset of causal mutations underlying a phenotype of interest. We assessed the performance (recall and precision) of three widely used single-locus GWAS approaches (cluster-based, dimensionality-reduction and linear mixed models, implemented in plink, pyseer and gemma) and one relatively new multi-locus model implemented in pyseer, across a range of simulated sample sizes, recombination rates and causal mutation effect sizes. As expected, all methods performed better with larger sample sizes and effect sizes. The performance of clustering and dimensionality reduction approaches to correct for population structure were considerably variable according to the choice of parameters. Notably, the multi-locus elastic net (lasso) approach was consistently amongst the highest-performing methods, and had the highest power in detecting causal variants with both low and high effect sizes. Most methods reached the level of good performance (recall >0.75) for identifying causal mutations of strong effect size [log odds ratio (OR) ≥2] with a sample size of 2000 genomes. However, only elastic nets reached the level of reasonable performance (recall=0.35) for detecting markers with weaker effects (log OR ~1) in smaller samples. Elastic nets also showed superior precision and recall in controlling for genome-wide linkage, relative to single-locus models. However, all methods performed relatively poorly on highly clonal (low-recombining) genomes, suggesting room for improvement in method development. These findings show the potential for multi-locus models to improve bacterial GWAS performance. BacGWASim code and simulated data are publicly available to enable further comparisons and benchmarking of new methods.
全基因组关联研究(GWAS)有可能揭示微生物表型的遗传基础,如抗生素耐药性和毒力。利用不断增长的细菌序列数据财富,微生物 GWAS 方法旨在识别因果遗传变异,同时忽略虚假关联。细菌以克隆方式繁殖,导致种群结构和全基因组连锁非常强,因此很难将真正的“命中”(即引起表型的突变)与非因果连锁突变区分开来。GWAS 方法试图以不同的方式纠正种群结构,但它们的性能尚未在一系列进化场景下得到系统和全面的评估。在这里,我们开发了一种细菌 GWAS 模拟器(BacGWASim),可以生成具有不同突变率、重组率和其他进化参数的细菌基因组,以及与感兴趣表型相关的一部分因果突变。我们评估了三种广泛使用的单基因座 GWAS 方法(基于聚类、降维以及线性混合模型,在 plink、pyseer 和 gemma 中实现)和一种新的多基因座模型(在 pyseer 中实现)的性能(召回率和准确率),涵盖了一系列模拟样本量、重组率和因果突变效应大小。正如预期的那样,所有方法在样本量和效应大小较大时表现更好。基于聚类和降维的方法对种群结构进行校正的性能根据参数的选择而有很大差异。值得注意的是,多基因座弹性网络(lasso)方法始终是表现最好的方法之一,并且在检测具有低和高效应大小的因果变异方面具有最高的功效。大多数方法在识别具有强效应大小(对数优势比(OR)≥2)的因果突变时达到了良好性能(召回率>0.75)的水平,样本量为 2000 个基因组。然而,只有弹性网络在较小样本中检测较弱效应(对数 OR~1)的标记时达到了合理性能(召回率=0.35)的水平。弹性网络在控制全基因组连锁方面也表现出优于单基因座模型的精确性和召回率。然而,所有方法在高度克隆(低重组)的基因组上表现相对较差,表明在方法开发方面还有改进的空间。这些发现表明,多基因座模型有可能提高细菌 GWAS 的性能。BacGWASim 代码和模拟数据是公开的,以方便对新方法进行进一步比较和基准测试。