Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen, Germany.
Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen, Germany.
Microb Genom. 2020 Oct;6(10). doi: 10.1099/mgen.0.000398.
Plasmids are extrachromosomal genetic elements that replicate independently of the chromosome and play a vital role in the environmental adaptation of bacteria. Due to potential mobilization or conjugation capabilities, plasmids are important genetic vehicles for antimicrobial resistance genes and virulence factors with huge and increasing clinical implications. They are therefore subject to large genomic studies within the scientific community worldwide. As a result of rapidly improving next-generation sequencing methods, the quantity of sequenced bacterial genomes is constantly increasing, in turn raising the need for specialized tools to (i) extract plasmid sequences from draft assemblies, (ii) derive their origin and distribution, and (iii) further investigate their genetic repertoire. Recently, several bioinformatic methods and tools have emerged to tackle this issue; however, a combination of high sensitivity and specificity in plasmid sequence identification is rarely achieved in a taxon-independent manner. In addition, many software tools are not appropriate for large high-throughput analyses or cannot be included in existing software pipelines due to their technical design or software implementation. In this study, we investigated differences in the replicon distributions of protein-coding genes on a large scale as a new approach to distinguish plasmid-borne from chromosome-borne contigs. We defined and computed statistical discrimination thresholds for a new metric: the replicon distribution score (RDS), which achieved an accuracy of 96.6 %. The final performance was further improved by the combination of the RDS metric with heuristics exploiting several plasmid-specific higher-level contig characterizations. We implemented this workflow in a new high-throughput taxon-independent bioinformatics software tool called Platon for the recruitment and characterization of plasmid-borne contigs from short-read draft assemblies. Compared to PlasFlow, Platon achieved a higher accuracy (97.5 %) and more balanced predictions (F1=82.6 %) tested on a broad range of bacterial taxa and better or equal performance against the targeted tools PlasmidFinder and PlaScope on sequenced isolates. Platon is available at: http://platon.computational.bio/.
质粒是一种染色体外的遗传元件,能够独立于染色体进行复制,在细菌的环境适应中起着至关重要的作用。由于潜在的可移动性或共轭能力,质粒是具有巨大且不断增加的临床意义的抗生素耐药基因和毒力因子的重要遗传载体。因此,它们是全球科学界进行大型基因组研究的对象。由于快速改进的下一代测序方法,测序的细菌基因组数量不断增加,这反过来又需要专门的工具来(i)从草案组装中提取质粒序列,(ii)推导它们的起源和分布,以及(iii)进一步研究它们的遗传组成。最近,出现了几种生物信息学方法和工具来解决这个问题;然而,很少有一种方法能够以独立于分类群的方式实现质粒序列识别的高灵敏度和特异性的结合。此外,由于技术设计或软件实现的原因,许多软件工具不适合大型高通量分析,或者无法包含在现有的软件管道中。在这项研究中,我们大规模研究了蛋白质编码基因的复制子分布差异,作为一种区分质粒携带和染色体携带片段的新方法。我们定义并计算了一个新指标的统计判别阈值:复制子分布得分(RDS),其准确率达到 96.6%。通过将 RDS 指标与利用几种质粒特异性高级片段特征的启发式方法相结合,最终性能得到了进一步提高。我们在一个名为 Platon 的新的高通量、独立于分类群的生物信息学软件工具中实现了这个工作流程,用于从短读序草案组装中招募和表征质粒携带的片段。与 PlasFlow 相比,Platon 在广泛的细菌分类群上测试时具有更高的准确性(97.5%)和更平衡的预测(F1=82.6%),并且在针对靶向工具 PlasmidFinder 和 PlaScope 的测序分离物上的性能更好或相等。Platon 可在:http://platon.computational.bio/ 获取。