1Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands.
2Julius Centre for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht, The Netherlands.
Microb Genom. 2018 Nov;4(11). doi: 10.1099/mgen.0.000224. Epub 2018 Nov 1.
Assembly of bacterial short-read whole-genome sequencing data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Complete genomes resolved by long-read sequencing can be used to generate and label short-read contigs. These were used to train several popular machine learning methods to classify the origin of contigs from Enterococcus faecium, Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. We selected support-vector machine (SVM) models as the best classifier for all three bacterial species (F1-score E. faecium=0.92, F1-score K. pneumoniae=0.90, F1-score E. coli=0.76), which outperformed other existing plasmid prediction tools using a benchmarking set of isolates. We demonstrated the scalability of our models by accurately predicting the plasmidome of a large collection of 1644 E. faecium isolates and illustrate its applicability by predicting the location of antibiotic-resistance genes in all three species. The SVM classifiers are publicly available as an R package and graphical-user interface called 'mlplasmids'. We anticipate that this tool may significantly facilitate research on the dissemination of plasmids encoding antibiotic resistance and/or contributing to host adaptation.
细菌短读全基因组测序数据的组装通常会产生数百个难以确定其起源、质粒或染色体的 contigs。通过长读测序解析的完整基因组可用于生成和标记短读 contigs。这些 contigs 可用于训练几种流行的机器学习方法,使用五聚体频率对屎肠球菌、肺炎克雷伯菌和大肠杆菌的 contigs 进行分类。我们选择支持向量机(SVM)模型作为这三种细菌的最佳分类器(屎肠球菌的 F1 分数=0.92,肺炎克雷伯菌的 F1 分数=0.90,大肠杆菌的 F1 分数=0.76),其使用基准分离株的性能优于其他现有的质粒预测工具。我们通过准确预测 1644 个屎肠球菌分离株的质粒组来证明我们模型的可扩展性,并通过预测所有三种细菌中抗生素抗性基因的位置来说明其适用性。SVM 分类器可作为一个 R 包和图形用户界面(称为“mlplasmids”)公开使用。我们预计,该工具可能会极大地促进对编码抗生素抗性的质粒传播和/或有助于宿主适应的研究。