Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.
Department of Life Sciences, Ben-Gurion University of the Negev and the National Institute for Biotechnology in the Negev, Marcus Family Campus, Beer-Sheva, Israel.
PLoS Comput Biol. 2020 Apr 3;16(4):e1007781. doi: 10.1371/journal.pcbi.1007781. eCollection 2020 Apr.
Many bacteria contain plasmids, but separating between contigs that originate on the plasmid and those that are part of the bacterial genome can be difficult. This is especially true in metagenomic assembly, which yields many contigs of unknown origin. Existing tools for classifying sequences of plasmid origin give less reliable results for shorter sequences, are trained using a fraction of the known plasmids, and can be difficult to use in practice. We present PlasClass, a new plasmid classifier. It uses a set of standard classifiers trained on the most current set of known plasmid sequences for different sequence lengths. We tested PlasClass sequence classification on held-out data and simulations, as well as publicly available bacterial isolates and plasmidome samples and plasmids assembled from metagenomic samples. PlasClass outperforms the state-of-the-art plasmid classification tool on shorter sequences, which constitute the majority of assembly contigs, allowing it to achieve higher F1 scores in classifying sequences from a wide range of datasets. PlasClass also uses significantly less time and memory. PlasClass can be used to easily classify plasmid and bacterial genome sequences in metagenomic or isolate assemblies. It is available under the MIT license from: https://github.com/Shamir-Lab/PlasClass.
许多细菌都含有质粒,但将源自质粒的 contigs 与细菌基因组的 contigs 区分开来可能很困难。在宏基因组组装中尤其如此,因为它会产生许多来源未知的 contigs。现有的用于分类质粒来源序列的工具在较短的序列上给出的结果不太可靠,它们是使用已知质粒的一小部分进行训练的,并且在实际中可能难以使用。我们提出了 PlasClass,这是一种新的质粒分类器。它使用一组针对不同序列长度的最新已知质粒序列进行训练的标准分类器。我们在保留数据和模拟数据、公开的细菌分离株和质粒组样本以及从宏基因组样本组装的质粒上测试了 PlasClass 的序列分类。PlasClass 在较短的序列上优于最先进的质粒分类工具,这些序列构成了大多数组装 contigs,使它能够在对来自各种数据集的序列进行分类时获得更高的 F1 分数。PlasClass 还使用的时间和内存显著减少。PlasClass 可用于轻松分类宏基因组或分离株组装中的质粒和细菌基因组序列。它可从以下网址以 MIT 许可证获得:https://github.com/Shamir-Lab/PlasClass。