IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):57-67. doi: 10.1109/TCBB.2021.3082915. Epub 2022 Feb 3.
Plasmids are extra-chromosomal genetic materials with important markers that affect the function and behaviour of the microorganisms supporting their environmental adaptations. Hence the identification and recovery of such plasmid sequences from assemblies is a crucial task in metagenomics analysis. In the past, machine learning approaches have been developed to separate chromosomes and plasmids. However, there is always a compromise between precision and recall in the existing classification approaches. The similarity of compositions between chromosomes and their plasmids makes it difficult to separate plasmids and chromosomes with high accuracy. However, high confidence classifications are accurate with a significant compromise of recall, and vice versa. Hence, the requirement exists to have more sophisticated approaches to separate plasmids and chromosomes accurately while retaining an acceptable trade-off between precision and recall. We present GraphPlas, a novel approach for plasmid recovery using coverage, composition and assembly graph topology. We evaluated GraphPlas on simulated and real short read assemblies with varying compositions of plasmids and chromosomes. Our experiments show that GraphPlas is able to significantly improve accuracy in detecting plasmid and chromosomal contigs on top of popular state-of-the-art plasmid detection tools. The source code is freely available at: https://github.com/anuradhawick/GraphPlas.
质粒是带有重要标记的染色体外遗传物质,这些标记影响着支持微生物环境适应的功能和行为。因此,从组装体中识别和回收这些质粒序列是宏基因组分析中的一项关键任务。过去,已经开发了机器学习方法来分离染色体和质粒。然而,现有的分类方法在精度和召回率之间总是存在折衷。染色体和质粒之间组成的相似性使得很难用高精度将质粒和染色体分开。然而,高置信度分类的召回率有显著的折衷,反之亦然。因此,需要有更复杂的方法来准确分离质粒和染色体,同时在精度和召回率之间保持可接受的权衡。我们提出了 GraphPlas,这是一种使用覆盖范围、组成和组装图拓扑结构来回收质粒的新方法。我们在具有不同质粒和染色体组成的模拟和真实短读组装体上评估了 GraphPlas。我们的实验表明,GraphPlas 能够显著提高在流行的最先进的质粒检测工具的基础上检测质粒和染色体 contigs 的准确性。源代码可在:https://github.com/anuradhawick/GraphPlas 上获得。