Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche Pour le Développement, 34000, Montpellier, France.
Genomics, Bioinformatics and Evolution. Departament de Genètica i Microbiologia, Universitat Autònoma de Barcelona, 08193, Cerdanyola del Vallès, Spain.
BMC Bioinformatics. 2021 Jun 26;22(1):349. doi: 10.1186/s12859-021-04270-w.
Plasmids are mobile genetic elements that often carry accessory genes, and are vectors for horizontal transfer between bacterial genomes. Plasmid detection in large genomic datasets is crucial to analyze their spread and quantify their role in bacteria adaptation and particularly in antibiotic resistance propagation. Bioinformatics methods have been developed to detect plasmids. However, they suffer from low sensitivity (i.e., most plasmids remain undetected) or low precision (i.e., these methods identify chromosomes as plasmids), and are overall not adapted to identify plasmids in whole genomes that are not fully assembled (contigs and scaffolds).
We developed PlasForest, a homology-based random forest classifier identifying bacterial plasmid sequences in partially assembled genomes. Without knowing the taxonomical origin of the samples, PlasForest identifies contigs as plasmids or chromosomes with a F1 score of 0.950. Notably, it can detect 77.4% of plasmid contigs below 1 kb with 2.8% of false positives and 99.9% of plasmid contigs over 50 kb with 2.2% of false positives.
PlasForest outperforms other currently available tools on genomic datasets by being both sensitive and precise. The performance of PlasForest on metagenomic assemblies are currently well below those of other k-mer-based methods, and we discuss how homology-based approaches could improve plasmid detection in such datasets.
质粒是一种可移动的遗传元件,通常携带辅助基因,是细菌基因组之间水平转移的载体。在大型基因组数据集中检测质粒对于分析其传播以及量化其在细菌适应,尤其是抗生素抗性传播中的作用至关重要。已经开发了生物信息学方法来检测质粒。然而,它们存在灵敏度低(即大多数质粒仍未被检测到)或精度低(即这些方法将染色体识别为质粒)的问题,并且总体上不适应于识别不完全组装的全基因组中的质粒(contigs 和 scaffolds)。
我们开发了 PlasForest,这是一种基于同源性的随机森林分类器,可用于识别部分组装基因组中的细菌质粒序列。无需了解样本的分类学来源,PlasForest 就能以 0.950 的 F1 得分为 contigs 识别出质粒或染色体。值得注意的是,它可以检测到 77.4%长度低于 1kb 的质粒 contigs,假阳性率为 2.8%,99.9%长度超过 50kb 的质粒 contigs,假阳性率为 2.2%。
PlasForest 在基因组数据集上的表现优于其他现有的工具,因为它既敏感又精确。PlasForest 在宏基因组组装上的性能目前远低于其他基于 k-mer 的方法,我们讨论了基于同源性的方法如何改善此类数据集的质粒检测。