Suppr超能文献

Plasmer:一种基于共享 k-mers 和基因组特征的机器学习的准确且灵敏的细菌质粒预测工具。

Plasmer: an Accurate and Sensitive Bacterial Plasmid Prediction Tool Based on Machine Learning of Shared k-mers and Genomic Features.

机构信息

State Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.

University of Chinese Academy of Sciences, Beijing, China.

出版信息

Microbiol Spectr. 2023 Jun 15;11(3):e0464522. doi: 10.1128/spectrum.04645-22. Epub 2023 May 16.

Abstract

Identification of plasmids in bacterial genomes is critical for many factors, including horizontal gene transfer, antibiotic resistance genes, host-microbe interactions, cloning vectors, and industrial production. There are several methods to predict plasmid sequences in assembled genomes. However, existing methods have evident shortcomings, such as unbalance in sensitivity and specificity, dependency on species-specific models, and performance reduction in sequences shorter than 10 kb, which has limited their scope of applicability. In this work, we proposed Plasmer, a novel plasmid predictor based on machine-learning of shared k-mers and genomic features. Unlike existing k-mer or genomic-feature based methods, Plasmer employs the random forest algorithm to make predictions using the percent of shared k-mers with plasmid and chromosome databases combined with other genomic features, including alignment E value and replicon distribution scores (RDS). Plasmer can predict on multiple species and has achieved an average the area under the curve (AUC) of 0.996 with accuracy of 98.4%. Compared to existing methods, tests of both sliding sequences and simulated and assemblies have consistently shown that Plasmer has outperforming accuracy and stable performance across long and short contigs above 500 bp, demonstrating its applicability for fragmented assemblies. Plasmer also has excellent and balanced performance on both sensitivity and specificity (both >0.95 above 500 bp) with the highest F1-score, which has eliminated the bias on sensitivity or specificity that was common in existing methods. Plasmer also provides taxonomy classification to help identify the origin of plasmids. In this study, we proposed a novel plasmid prediction tool named Plasmer. Technically, unlike existing k-mer or genomic features-based methods, Plasmer is the first tool to combine the advantages of the percent of shared k-mers and the alignment score of genomic features. This has given Plasmer (i) evident improvement in performance compared to other methods, with the best F1-score and accuracy on sliding sequences, simulated contigs, and assemblies; (ii) applicability for contigs above 500 bp with highest accuracy, enabling plasmid prediction in fragmented short-read assemblies; (iii) excellent and balanced performance between sensitivity and specificity (both >0.95 above 500 bp) with the highest F1-score, which eliminated the bias on sensitivity or specificity that commonly existed in other methods; and (iv) no dependency of species-specific training models. We believe that Plasmer provides a more reliable alternative for plasmid prediction in bacterial genome assemblies.

摘要

鉴定细菌基因组中的质粒对于许多因素至关重要,包括水平基因转移、抗生素抗性基因、宿主-微生物相互作用、克隆载体和工业生产。有几种方法可以预测组装基因组中的质粒序列。然而,现有的方法存在明显的缺点,例如灵敏度和特异性不平衡、依赖于物种特异性模型以及序列短于 10kb 时性能降低,这限制了它们的适用范围。在这项工作中,我们提出了 Plasmer,这是一种基于共享 k-mer 和基因组特征的机器学习的新型质粒预测器。与现有的基于 k-mer 或基因组特征的方法不同,Plasmer 采用随机森林算法,使用与质粒和染色体数据库共享的 k-mer 百分比以及其他基因组特征(包括对齐 E 值和复制子分布得分(RDS))进行预测。Plasmer 可以在多个物种上进行预测,平均曲线下面积(AUC)为 0.996,准确率为 98.4%。与现有方法相比,滑动序列和模拟和组装的测试一致表明,Plasmer 在长和短的大于 500bp 的片段上具有更高的准确性和稳定的性能,证明了其在碎片化组装中的适用性。Plasmer 在灵敏度和特异性(两者在大于 500bp 时均>0.95)上均具有出色且平衡的性能,具有最高的 F1 分数,消除了现有方法中常见的灵敏度或特异性偏倚。Plasmer 还提供了分类法分类,以帮助识别质粒的来源。在这项研究中,我们提出了一种名为 Plasmer 的新型质粒预测工具。从技术上讲,与现有的基于 k-mer 或基因组特征的方法不同,Plasmer 是第一个结合共享 k-mer 百分比和基因组特征对齐分数优势的工具。这使得 Plasmer(i)与其他方法相比,性能明显提高,在滑动序列、模拟的连续体和组装体上具有最佳的 F1 分数和准确率;(ii)适用于大于 500bp 的连续体,准确率最高,可在碎片化的短读组装中进行质粒预测;(iii)灵敏度和特异性之间具有出色且平衡的性能(两者在大于 500bp 时均>0.95),具有最高的 F1 分数,消除了其他方法中常见的灵敏度或特异性偏差;(iv)不依赖于物种特异性训练模型。我们相信 Plasmer 为细菌基因组组装中的质粒预测提供了更可靠的替代方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f758/10269668/1238f7f66349/spectrum.04645-22-f001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验