1Department of Medical Microbiology, Universitair Medisch Centrum Utrecht, Utrecht, The Netherlands.
2Institute of Microbiology and Infection, University of Birmingham, Birmingham, England, UK.
Microb Genom. 2017 Aug 18;3(10):e000128. doi: 10.1099/mgen.0.000128. eCollection 2017 Oct.
To benchmark algorithms for automated plasmid sequence reconstruction from short-read sequencing data, we selected 42 publicly available complete bacterial genome sequences spanning 12 genera, containing 148 plasmids. We predicted plasmids from short-read data with four programs (PlasmidSPAdes, Recycler, cBar and PlasmidFinder) and compared the outcome to the reference sequences. PlasmidSPAdes reconstructs plasmids based on coverage differences in the assembly graph. It reconstructed most of the reference plasmids (recall=0.82), but approximately a quarter of the predicted plasmid contigs were false positives (precision=0.75). PlasmidSPAdes merged 84 % of the predictions from genomes with multiple plasmids into a single bin. Recycler searches the assembly graph for sub-graphs corresponding to circular sequences and correctly predicted small plasmids, but failed with long plasmids (recall=0.12, precision=0.30). cBar, which applies pentamer frequency analysis to detect plasmid-derived contigs, showed a recall and precision of 0.76 and 0.62, respectively. However, cBar categorizes contigs as plasmid-derived and does not bin the different plasmids. PlasmidFinder, which searches for replicons, had the highest precision (1.0), but was restricted by the contents of its database and the contig length obtained from assembly (recall=0.36). PlasmidSPAdes and Recycler detected putative small plasmids (<10 kbp), which were also predicted as plasmids by cBar, but were absent in the original assembly. This study shows that it is possible to automatically predict small plasmids. Prediction of large plasmids (>50 kbp) containing repeated sequences remains challenging and limits the high-throughput analysis of plasmids from short-read whole-genome sequencing data.
为了对从短读测序数据中自动重建质粒序列的算法进行基准测试,我们选择了 42 个公开的完整细菌基因组序列,涵盖 12 个属,包含 148 个质粒。我们使用四个程序(PlasmidSPAdes、Recycler、cBar 和 PlasmidFinder)从短读数据中预测质粒,并将结果与参考序列进行比较。PlasmidSPAdes 根据组装图中的覆盖差异来重建质粒。它重建了大多数参考质粒(召回率=0.82),但大约四分之一的预测质粒 contigs 是假阳性(精度=0.75)。PlasmidSPAdes 将来自多个质粒基因组的预测中的 84%合并到单个 bin 中。Recycler 在组装图中搜索对应于圆形序列的子图,并正确预测了小质粒,但对于长质粒则失败(召回率=0.12,精度=0.30)。cBar 应用五聚体频率分析来检测质粒衍生的 contigs,召回率和精度分别为 0.76 和 0.62。然而,cBar 将 contigs 分类为质粒衍生的,而不将不同的质粒进行 bin 化。PlasmidFinder 搜索复制子,具有最高的精度(1.0),但受到其数据库内容和组装获得的 contig 长度的限制(召回率=0.36)。PlasmidSPAdes 和 Recycler 检测到了假定的小质粒(<10 kbp),cBar 也将其预测为质粒,但在原始组装中不存在。本研究表明,自动预测小质粒是可能的。预测含有重复序列的大质粒(>50 kbp)仍然具有挑战性,限制了从短读全基因组测序数据中对质粒进行高通量分析。