Informatics Institute, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35294, Birmingham, USA.
Department of Genetics, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35294, Birmingham, USA.
BMC Genomics. 2022 May 11;23(Suppl 4):361. doi: 10.1186/s12864-022-08577-7.
Accurate bacteria genome de novo assembly is fundamental to understand the evolution and pathogenesis of new bacteria species. The advent and popularity of Third-Generation Sequencing (TGS) enables assembly of bacteria genomes at an unprecedented speed. However, most current TGS assemblers were specifically designed for human or other species that do not have a circular genome. Besides, the repetitive DNA fragments in many bacterial genomes plus the high error rate of long sequencing data make it still very challenging to accurately assemble their genomes even with a relatively small genome size. Therefore, there is an urgent need for the development of an optimized method to address these issues.
We developed B-assembler, which is capable of assembling bacterial genomes when there are only long reads or a combination of short and long reads. B-assembler takes advantage of the structural resolving power of long reads and the accuracy of short reads if applicable. It first selects and corrects the ultra-long reads to get an initial contig. Then, it collects the reads overlapping with the ends of the initial contig. This two-round assembling procedure along with optimized error correction enables a high-confidence and circularized genome assembly. Benchmarked on both synthetic and real sequencing data of several species of bacterium, the results show that both long-read-only and hybrid-read modes can accurately assemble circular bacterial genomes free of structural errors and have fewer small errors compared to other assemblers.
B-assembler provides a better solution to bacterial genome assembly, which will facilitate downstream bacterial genome analysis.
准确的细菌基因组从头组装对于理解新细菌物种的进化和发病机制至关重要。第三代测序(TGS)的出现和普及使细菌基因组的组装速度达到了前所未有的水平。然而,大多数当前的 TGS 组装器都是专门为人类或其他没有圆形基因组的物种设计的。此外,许多细菌基因组中的重复 DNA 片段以及长测序数据的高错误率使得即使基因组相对较小,准确组装它们的基因组仍然非常具有挑战性。因此,迫切需要开发一种优化的方法来解决这些问题。
我们开发了 B-assembler,它能够在只有长读长或短读长和长读长组合的情况下组装细菌基因组。B-assembler 利用了长读长的结构解析能力和短读长的准确性(如果适用)。它首先选择和纠正超长读长以获得初始连续序列。然后,它收集与初始连续序列末端重叠的读长。这两轮组装过程以及优化的纠错功能,实现了高可信度和圆形化基因组组装。在几种细菌的合成和真实测序数据上进行基准测试的结果表明,长读长模式和混合读长模式都可以准确地组装无结构错误的圆形细菌基因组,与其他组装器相比,错误更少。
B-assembler 为细菌基因组组装提供了更好的解决方案,这将有助于下游的细菌基因组分析。