Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.
Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America.
PLoS Comput Biol. 2022 Feb 4;18(2):e1009860. doi: 10.1371/journal.pcbi.1009860. eCollection 2022 Feb.
Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.
第三代测序技术可以产生具有相对较高错误率的非常长的读段。这些读段的长度有时超过一百万碱基,对于解决使用较短读段无法组装的复杂重复序列非常有价值。许多高质量的基因组组装已经使用前一代测序数据进行了生成、整理和注释,并且使用长读段完全重新组装这些基因组并不总是可行或具有成本效益。一种升级现有组装的策略是使用长读段数据生成额外的覆盖,并将其添加到之前组装的 contigs 中。SAMBA 是一种设计用于使用额外的长读段数据对现有基因组组装进行支架和填补缺口的工具,从而大大提高了连续性。SAMBA 是唯一一种能够计算并填充支架中所有跨度缺口序列的工具,从而产生更长的 contigs。在这里,我们将 SAMBA 与几个能够使用长读段数据重新支架组装的类似工具进行比较,并表明 SAMBA 比竞争方法具有更好的连续性和更少的错误。SAMBA 是一个开源软件,可在 https://github.com/alekseyzimin/masurca 上获得。