SAMBA 工具利用长读段来提高基因组组装的连续性。

The SAMBA tool uses long reads to improve the contiguity of genome assemblies.

机构信息

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.

Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America.

出版信息

PLoS Comput Biol. 2022 Feb 4;18(2):e1009860. doi: 10.1371/journal.pcbi.1009860. eCollection 2022 Feb.

DOI:10.1371/journal.pcbi.1009860

PMID:35120119

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8849508/

Abstract

Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.

摘要

第三代测序技术可以产生具有相对较高错误率的非常长的读段。这些读段的长度有时超过一百万碱基，对于解决使用较短读段无法组装的复杂重复序列非常有价值。许多高质量的基因组组装已经使用前一代测序数据进行了生成、整理和注释，并且使用长读段完全重新组装这些基因组并不总是可行或具有成本效益。一种升级现有组装的策略是使用长读段数据生成额外的覆盖，并将其添加到之前组装的 contigs 中。SAMBA 是一种设计用于使用额外的长读段数据对现有基因组组装进行支架和填补缺口的工具，从而大大提高了连续性。SAMBA 是唯一一种能够计算并填充支架中所有跨度缺口序列的工具，从而产生更长的 contigs。在这里，我们将 SAMBA 与几个能够使用长读段数据重新支架组装的类似工具进行比较，并表明 SAMBA 比竞争方法具有更好的连续性和更少的错误。SAMBA 是一个开源软件，可在 https://github.com/alekseyzimin/masurca 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/815c/8849508/dda59eca34e6/pcbi.1009860.g001.jpg

相似文献

The SAMBA tool uses long reads to improve the contiguity of genome assemblies.

PLoS Comput Biol. 2022 Feb 4;18(2):e1009860. doi: 10.1371/journal.pcbi.1009860. eCollection 2022 Feb.

TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads.

Gigascience. 2020 Sep 1;9(9). doi: 10.1093/gigascience/giaa094.

LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly.

Gigascience. 2019 Jan 1;8(1):giy157. doi: 10.1093/gigascience/giy157.

SpLitteR: diploid genome assembly using TELL-Seq linked-reads and assembly graphs.

PeerJ. 2024 Sep 27;12:e18050. doi: 10.7717/peerj.18050. eCollection 2024.

Evaluation and Validation of Assembling Corrected PacBio Long Reads for Microbial Genome Completion via Hybrid Approaches.

PLoS One. 2015 Dec 7;10(12):e0144305. doi: 10.1371/journal.pone.0144305. eCollection 2015.

SLR: a scaffolding algorithm based on long reads and contig classification.

BMC Bioinformatics. 2019 Oct 30;20(1):539. doi: 10.1186/s12859-019-3114-9.

Tigmint: correcting assembly errors using linked reads from large molecules.

BMC Bioinformatics. 2018 Oct 26;19(1):393. doi: 10.1186/s12859-018-2425-6.

LongStitch: high-quality genome assembly correction and scaffolding using long reads.

BMC Bioinformatics. 2021 Oct 30;22(1):534. doi: 10.1186/s12859-021-04451-7.

Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads.

PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. doi: 10.1371/journal.pcbi.1005595. eCollection 2017 Jun.

SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies.

BMC Genomics. 2019 Apr 18;19(Suppl 9):238. doi: 10.1186/s12864-019-5445-3.

引用本文的文献

TRFill: synergistic use of HiFi and Hi-C sequencing enables accurate assembly of tandem repeats for population-level analysis.

Genome Biol. 2025 Jul 28;26(1):227. doi: 10.1186/s13059-025-03685-5.

Evolutionary history of magnoliid genomes and benzylisoquinoline alkaloid biosynthesis.

Nat Commun. 2025 Apr 29;16(1):4039. doi: 10.1038/s41467-025-59343-8.

Three reference genomes for freshwater diatom ecology and evolution.

J Phycol. 2025 Apr;61(2):267-274. doi: 10.1111/jpy.13545. Epub 2025 Feb 10.

Inter-chromosomal transcription hubs shape the 3D genome architecture of African trypanosomes.

Nat Commun. 2024 Dec 23;15(1):10716. doi: 10.1038/s41467-024-55285-9.

A reference quality, fully annotated diploid genome from a Saudi individual.

Sci Data. 2024 Nov 23;11(1):1278. doi: 10.1038/s41597-024-04121-2.

The nuclear and mitochondrial genome assemblies of Tetragonisca angustula (Apidae: Meliponini), a tiny yet remarkable pollinator in the Neotropics.

BMC Genomics. 2024 Jun 11;25(1):587. doi: 10.1186/s12864-024-10502-z.

Phased chromosome-scale genome assembly of an asexual, allopolyploid root-knot nematode reveals complex subgenomic structure.

PLoS One. 2024 Jun 6;19(6):e0302506. doi: 10.1371/journal.pone.0302506. eCollection 2024.

A genome sequence for the threatened whitebark pine.

G3 (Bethesda). 2024 May 7;14(5). doi: 10.1093/g3journal/jkae061.

Genome Assembly and Annotation of the Dark-Branded Bushbrown Butterfly Mycalesis mineus (Nymphalidae: Satyrinae).

Genome Biol Evol. 2024 Mar 2;16(3). doi: 10.1093/gbe/evae051.

Comprehensive Evaluation of Genome Gap-Filling Tools Utilizing Long Reads.

Genes (Basel). 2024 Jan 20;15(1):127. doi: 10.3390/genes15010127.

本文引用的文献

A reference-quality, fully annotated genome from a Puerto Rican individual.

Genetics. 2022 Feb 4;220(2). doi: 10.1093/genetics/iyab227.

Efficient hybrid de novo assembly of human genomes with WENGAN.

Nat Biotechnol. 2021 Apr;39(4):422-430. doi: 10.1038/s41587-020-00747-w. Epub 2020 Dec 14.

Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies.

Genome Biol. 2020 Sep 14;21(1):245. doi: 10.1186/s13059-020-02134-9.

Telomere-to-telomere assembly of a complete human X chromosome.

Nature. 2020 Sep;585(7823):79-84. doi: 10.1038/s41586-020-2547-7. Epub 2020 Jul 14.

The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies.

PLoS Comput Biol. 2020 Jun 26;16(6):e1007981. doi: 10.1371/journal.pcbi.1007981. eCollection 2020 Jun.

LRScaf: improving draft genomes using long noisy reads.

BMC Genomics. 2019 Dec 9;20(1):955. doi: 10.1186/s12864-019-6337-2.

Assembly of long, error-prone reads using repeat graphs.

Nat Biotechnol. 2019 May;37(5):540-546. doi: 10.1038/s41587-019-0072-8. Epub 2019 Apr 1.

Minimap2: pairwise alignment for nucleotide sequences.

Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191.

Hybrid assembly of the large and highly repetitive genome of , a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.

Genome Res. 2017 May;27(5):787-792. doi: 10.1101/gr.213405.116. Epub 2017 Jan 27.

LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads.

Gigascience. 2015 Aug 4;4:35. doi: 10.1186/s13742-015-0076-3. eCollection 2015.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

SAMBA 工具利用长读段来提高基因组组装的连续性。

The SAMBA tool uses long reads to improve the contiguity of genome assemblies.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献