BMC Bioinformatics. 2013;14 Suppl 15(Suppl 15):S16. doi: 10.1186/1471-2105-14-S15-S16. Epub 2013 Oct 15.
Among challenges that hamper reaping the benefits of genome assembly are both unfinished assemblies and the ensuing experimental costs. First, numerous software solutions for genome de novo assembly are available, each having its advantages and drawbacks, without clear guidelines as to how to choose among them. Second, these solutions produce draft assemblies that often require a resource intensive finishing phase.
In this paper we address these two aspects by developing Mix , a tool that mixes two or more draft assemblies, without relying on a reference genome and having the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a set of paths in the extension graph that maximizes the cumulative contig length.
We evaluate the performance of Mix on bacterial NGS data from the GAGE-B study and apply it to newly sequenced Mycoplasma genomes. Resulting final assemblies demonstrate a significant improvement in the overall assembly quality. In particular, Mix is consistent by providing better overall quality results even when the choice is guided solely by standard assembly statistics, as is the case for de novo projects.
Mix is implemented in Python and is available at https://github.com/cbib/MIX, novel data for our Mycoplasma study is available at http://services.cbib.u-bordeaux2.fr/mix/.
妨碍充分利用基因组组装的益处的挑战包括未完成的组装和随之而来的实验成本。首先,有许多用于从头组装基因组的软件解决方案,每种解决方案都有其优点和缺点,但没有明确的指导方针说明如何在它们之间进行选择。其次,这些解决方案产生的草案组装通常需要资源密集型的完成阶段。
在本文中,我们通过开发 Mix 来解决这两个问题,Mix 是一种不依赖参考基因组的工具,可以混合两个或多个草案组装,目标是减少连续片段的碎片化,从而加快基因组的完成。所提出的算法构建了一个扩展图,其中顶点代表连续体的端点,边代表这些端点之间现有的对齐。这些对齐边用于连续体的扩展。生成的输出组装对应于扩展图中的一组路径,这些路径最大限度地提高了累积连续体的长度。
我们在 GAGE-B 研究的细菌 NGS 数据上评估了 Mix 的性能,并将其应用于新测序的支原体基因组。结果表明最终组装的整体质量有了显著提高。特别是,Mix 通过提供更好的整体质量结果,即使选择仅由标准组装统计数据指导,这在从头开始的项目中也是如此,表现出一致性。
Mix 是用 Python 实现的,可在 https://github.com/cbib/MIX 上获得,我们的支原体研究的新数据可在 http://services.cbib.u-bordeaux2.fr/mix/ 上获得。