Chalka Antonia, Dallman Tim J, Vohra Prerna, Stevens Mark P, Gally David L
The Roslin Institute and R(D)SVS, University of Edinburgh, Edinburgh, UK.
Institute for Risk Assessment Sciences (IRAS), University of Utrecht, Heidelberglaan, Utrecht, Netherlands.
Microb Genom. 2023 Oct;9(10). doi: 10.1099/mgen.0.001116.
is a taxonomically diverse pathogen with over 2600 serovars associated with a wide variety of animal hosts including humans, other mammals, birds and reptiles. Some serovars are host-specific or host-restricted and cause disease in distinct host species, while others, such as serovar . Typhimurium (STm), are generalists and have the potential to colonize a wide variety of species. However, even within generalist serovars such as STm it is becoming clear that pathovariants exist that differ in tropism and virulence. Identifying the genetic factors underlying host specificity is complex, but the availability of thousands of genome sequences and advances in machine learning have made it possible to build specific host prediction models to aid outbreak control and predict the human pathogenic potential of isolates from animals and other reservoirs. We have advanced this area by building host-association prediction models trained on a wide range of genomic features and compared them with predictions based on nearest-neighbour phylogeny. SNPs, protein variants (PVs), antimicrobial resistance (AMR) profiles and intergenic regions (IGRs) were extracted from 3883 high-quality STm assemblies collected from humans, swine, bovine and poultry in the USA, and used to construct Random Forest (RF) machine learning models. An additional 244 recent STm assemblies from farm animals were used as a test set for further validation. The models based on PVs and IGRs had the best performance in terms of predicting the host of origin of isolates and outperformed nearest-neighbour phylogenetic host prediction as well as models based on SNPs or AMR data. However, the models did not yield reliable predictions when tested with isolates that were phylogenetically distinct from the training set. The IGR and PV models were often able to differentiate human isolates in clusters where the majority of isolates were from a single animal source. Notably, IGRs were the feature with the best performance across multiple models which may be due to IGRs acting as both a representation of their flanking genes, equivalent to PVs, while also capturing genomic regulatory variation, such as altered promoter regions. The IGR and PV models predict that ~45 % of the human infections with STm in the USA originate from bovine, ~40 % from poultry and ~14.5 % from swine, although sequences of isolates from other sources were not used for training. In summary, the research demonstrates a significant gain in accuracy for models with IGRs and PVs as features compared to SNP-based and core genome phylogeny predictions when applied within the existing population structure. This article contains data hosted by Microreact.
是一种分类学上多样的病原体,有超过2600个血清型,与包括人类、其他哺乳动物、鸟类和爬行动物在内的多种动物宿主相关。一些血清型具有宿主特异性或宿主限制性,在不同的宿主物种中引起疾病,而其他血清型,如鼠伤寒血清型(STm),则具有通用性,有可能在多种物种中定殖。然而,即使在像STm这样的通用血清型中,也越来越清楚地表明存在着在嗜性和毒力方面不同的致病变种。确定宿主特异性背后的遗传因素很复杂,但数千个基因组序列的可用性和机器学习的进展使得构建特定的宿主预测模型成为可能,以帮助控制疫情并预测来自动物和其他宿主的分离株的人类致病潜力。我们通过构建基于广泛基因组特征训练的宿主关联预测模型推进了这一领域,并将其与基于最近邻系统发育的预测进行了比较。从美国人类、猪、牛和家禽中收集的3883个高质量STm组装体中提取单核苷酸多态性(SNPs)、蛋白质变体(PVs)、抗菌药物耐药性(AMR)谱和基因间区域(IGRs),并用于构建随机森林(RF)机器学习模型。另外244个来自农场动物的近期STm组装体用作测试集进行进一步验证。基于PVs和IGRs的模型在预测分离株的起源宿主方面表现最佳,优于最近邻系统发育宿主预测以及基于SNPs或AMR数据的模型。然而,当用与训练集系统发育不同的分离株进行测试时,这些模型没有产生可靠的预测。IGR和PV模型通常能够在大多数分离株来自单一动物来源的簇中区分人类分离株。值得注意的是,IGRs是多个模型中表现最佳的特征,这可能是因为IGRs既代表其侧翼基因,等同于PVs,同时也捕获基因组调控变异,如启动子区域的改变。IGR和PV模型预测,在美国,约45%的人类STm感染源自牛,约40%源自家禽,约14.5%源自猪,尽管来自其他来源的分离株序列未用于训练。总之,该研究表明,与基于SNP和核心基因组系统发育预测相比,以IGRs和PVs为特征的模型在应用于现有种群结构时准确性有显著提高。本文包含由Microreact托管的数据。