MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom
Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway.
mBio. 2020 Jul 7;11(4):e01344-20. doi: 10.1128/mBio.01344-20.
Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially. Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.
发现与细菌表型相关的遗传变异体以及预测抗生素耐药性等表型是细菌基因组学的基本任务。全基因组关联研究(GWAS)方法已被应用于研究这些关系,但细菌基因组的可塑性和细菌种群的克隆结构带来了挑战。我们引入了一种无比对方法,该方法可以找到与细菌表型相关的基因座集,量化遗传对表型的总影响,并允许在单个可计算的联合建模框架内进行准确的表型预测。涵盖整个泛基因组的遗传变异体由称为单元的扩展 DNA 序列字紧凑地表示,并且通过弹性网络惩罚来实现模型拟合,这是标准多元回归的扩展。使用广泛的最新细菌群体基因组数据集,我们证明了我们的方法可以进行准确的表型预测,与流行的机器学习方法相当,同时保留可解释性和计算效率。与之前的方法相比,我们的联合建模方法选择的变体与之前的方法有很大的重叠,这些方法分别针对每个变体测试每个基因型-表型关联,并应用显著性阈值。自细菌遗传学诞生以来,确定导致特定细菌表型的遗传变异体一直是其目标,这也是我们目前对细菌理解的基础。这种鉴定主要基于艰苦的实验,但具有相关表型元数据的整个基因组的大型数据集的可用性有望彻底改变这种方法,尤其是对于不易进行实验室分析的重要临床表型。这些表型-基因型关联模型将来可以用于通过快速周转或即时护理测试快速预测临床上重要的表型,例如抗生素耐药性和毒力。然而,尽管人们努力适应全基因组关联研究(GWAS)方法来应对细菌特有的问题,例如强烈的种群结构和水平基因交换,但目前的方法还不是最佳的。我们描述了一种方法,该方法推进了关联和生成便携式预测模型的方法。