Suppr超能文献

利用全基因组测序数据开发并验证一种随机森林算法,用于英格兰和威尔士动物及人类鼠伤寒沙门氏菌以及鼠伤寒沙门氏菌单相变体分离株的溯源分析。

Development and validation of a random forest algorithm for source attribution of animal and human Typhimurium and monophasic variants of Typhimurium isolates in England and Wales utilising whole genome sequencing data.

作者信息

Guzinski Jaromir, Tang Yue, Chattaway Marie Anne, Dallman Timothy J, Petrovska Liljana

机构信息

Animal and Plant Health Agency, Bacteriology Department, Addlestone, United Kingdom.

Gastrointestinal Bacteria Reference Unit, UK Health Security Agency, London, United Kingdom.

出版信息

Front Microbiol. 2024 Mar 12;14:1254860. doi: 10.3389/fmicb.2023.1254860. eCollection 2023.

Abstract

Source attribution has traditionally involved combining epidemiological data with different pathogen characterisation methods, including 7-gene multi locus sequence typing (MLST) or serotyping, however, these approaches have limited resolution. In contrast, whole genome sequencing data provide an overview of the whole genome that can be used by attribution algorithms. Here, we applied a random forest (RF) algorithm to predict the primary sources of human clinical Typhimurium ( Typhimurium) and monophasic variants (monophasic Typhimurium) isolates. To this end, we utilised single nucleotide polymorphism diversity in the core genome MLST alleles obtained from 1,061 laboratory-confirmed human and animal Typhimurium and monophasic Typhimurium isolates as inputs into a RF model. The algorithm was used for supervised learning to classify 399 animal Typhimurium and monophasic Typhimurium isolates into one of eight distinct primary source classes comprising common livestock and pet animal species: cattle, pigs, sheep, other mammals (pets: mostly dogs and horses), broilers, layers, turkeys, and game birds (pheasants, quail, and pigeons). When applied to the training set animal isolates, model accuracy was 0.929 and kappa 0.905, whereas for the test set animal isolates, for which the primary source class information was withheld from the model, the accuracy was 0.779 and kappa 0.700. Subsequently, the model was applied to assign 662 human clinical cases to the eight primary source classes. In the dataset, 60/399 (15.0%) of the animal and 141/662 (21.3%) of the human isolates were associated with a known outbreak of Typhimurium definitive type (DT) 104. All but two of the 141 DT104 outbreak linked human isolates were correctly attributed by the model to the primary source classes identified as the origin of the DT104 outbreak. A model that was run without the clonal DT104 animal isolates produced largely congruent outputs (training set accuracy 0.989 and kappa 0.985; test set accuracy 0.781 and kappa 0.663). Overall, our results show that RF offers considerable promise as a suitable methodology for epidemiological tracking and source attribution for foodborne pathogens.

摘要

传统上,溯源归因涉及将流行病学数据与不同的病原体特征分析方法相结合,包括7基因多位点序列分型(MLST)或血清分型,然而,这些方法的分辨率有限。相比之下,全基因组测序数据提供了整个基因组的概况,可被归因算法利用。在此,我们应用随机森林(RF)算法来预测人类临床鼠伤寒沙门氏菌(鼠伤寒沙门氏菌)和单相变体(单相鼠伤寒沙门氏菌)分离株的主要来源。为此,我们利用从1061株经实验室确认的人类和动物鼠伤寒沙门氏菌及单相鼠伤寒沙门氏菌分离株中获得的核心基因组MLST等位基因中的单核苷酸多态性多样性,作为RF模型的输入。该算法用于监督学习,将399株动物鼠伤寒沙门氏菌和单相鼠伤寒沙门氏菌分离株分类到八个不同的主要来源类别之一,这些类别包括常见的家畜和宠物物种:牛、猪、羊、其他哺乳动物(宠物:主要是狗和马)、肉鸡、蛋鸡、火鸡和野味鸟类(野鸡、鹌鹑和鸽子)。当应用于训练集动物分离株时,模型准确率为0.929,kappa值为0.905,而对于测试集动物分离株,模型被隐瞒了主要来源类别信息,其准确率为0.779,kappa值为0.700。随后,该模型被应用于将662例人类临床病例分配到八个主要来源类别。在数据集中,60/399(15.0%)的动物分离株和141/662(21.3%)的人类分离株与已知的鼠伤寒沙门氏菌定型(DT)104疫情有关。141株与DT104疫情相关的人类分离株中,除两株外,其余所有分离株均被模型正确归为被确定为DT104疫情源头的主要来源类别。一个不包含克隆性DT104动物分离株运行的模型产生了大致一致的输出结果(训练集准确率0.989,kappa值0.985;测试集准确率0.781,kappa值0.663)。总体而言,我们的结果表明,随机森林作为一种适用于食源性病原体流行病学追踪和溯源归因的方法具有很大的前景。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f1a2/10963456/97bac31aea26/fmicb-14-1254860-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验