利用压缩数据结构进行高效的从头基因组组装。

Efficient de novo assembly of large genomes using compressed data structures.

机构信息

Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.

出版信息

Genome Res. 2012 Mar;22(3):549-56. doi: 10.1101/gr.126953.111. Epub 2011 Dec 7.

DOI:10.1101/gr.126953.111

PMID:22156294

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3290790/

Abstract

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

摘要

从头基因组序列组装对于生成以前未表征的基因组的新序列组装以及以无参考偏向的方式识别个体的基因组序列都很重要。我们提出了使用从压缩的 Burrows-Wheeler 变换得出的 FM-index 进行组装的内存高效数据结构和算法，以及一个基于这些算法的新的组装器，称为 SGA（字符串图组装器）。我们描述了用于纠错、组装和支架大量序列数据的算法。SGA 使用基于重叠的字符串图组装模型，与大多数依赖于 de Bruijn 图的从头组装器不同，并且可以简单地并行化。我们在人类基因组的 12 亿个序列读取上展示了 SGA 的纠错和组装性能，我们能够使用 54GB 的内存进行组装。得到的 contigs 高度准确且连续，同时覆盖了参考基因组的 95%（不包括长度小于 200bp 的 contigs）。由于内存需求低，并且无需进程间通信即可进行并行化，因此 SGA 是我们所知的第一个实用的组装器，可用于低端计算集群上的哺乳动物大小的基因组。

相似文献

Efficient de novo assembly of large genomes using compressed data structures.利用压缩数据结构进行高效的从头基因组组装。

Genome Res. 2012 Mar;22(3):549-56. doi: 10.1101/gr.126953.111. Epub 2011 Dec 7.

FSG: Fast String Graph Construction for De Novo Assembly.FSG：用于从头组装的快速字符串图构建

J Comput Biol. 2017 Oct;24(10):953-968. doi: 10.1089/cmb.2017.0089. Epub 2017 Jul 17.

Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.用于纳米孔数据的从头组装算法基准测试揭示了重叠布局一致（OLC）方法的最佳性能。

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.

Efficient construction of an assembly string graph using the FM-index.利用 FM 索引高效构建组装字符串图。

Bioinformatics. 2010 Jun 15;26(12):i367-73. doi: 10.1093/bioinformatics/btq217.

FastEtch: A Fast Sketch-Based Assembler for Genomes.FastEtch：一种基于草图的快速基因组装配器。

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1091-1106. doi: 10.1109/TCBB.2017.2737999. Epub 2017 Sep 11.

BASE: a practical de novo assembler for large genomes using long NGS reads.BASE：一种使用长读长二代测序数据进行大型基因组从头组装的实用工具。

BMC Genomics. 2016 Aug 31;17 Suppl 5(Suppl 5):499. doi: 10.1186/s12864-016-2829-5.

Clover: a clustering-oriented de novo assembler for Illumina sequences.Clover：一款面向聚类的 Illumina 序列从头组装程序。

BMC Bioinformatics. 2020 Nov 17;21(1):528. doi: 10.1186/s12859-020-03788-9.

LSG: An External-Memory Tool to Compute String Graphs for Next-Generation Sequencing Data Assembly.LSG：一种用于为下一代测序数据组装计算字符串图的外部存储工具。

J Comput Biol. 2016 Mar;23(3):137-49. doi: 10.1089/cmb.2015.0172.

Integrating long-range connectivity information into de Bruijn graphs.将长程连接信息整合到 de Bruijn 图中。

Bioinformatics. 2018 Aug 1;34(15):2556-2565. doi: 10.1093/bioinformatics/bty157.

Simplitigs as an efficient and scalable representation of de Bruijn graphs.Simplitigs 作为一种高效且可扩展的 de Bruijn 图表示方法。

Genome Biol. 2021 Apr 6;22(1):96. doi: 10.1186/s13059-021-02297-z.

引用本文的文献

Holocene shifts in marine mammal distributions around Northern Greenland revealed by sedimentary ancient DNA.沉积古DNA揭示格陵兰岛北部周围海洋哺乳动物分布的全新世变化。

Nat Commun. 2025 May 15;16(1):4543. doi: 10.1038/s41467-025-59731-0.

The Role of MSI Testing Methodology and Its Heterogeneity in Predicting Colorectal Cancer Immunotherapy Response.微卫星不稳定性检测方法及其异质性在预测结直肠癌免疫治疗反应中的作用

Int J Mol Sci. 2025 Apr 5;26(7):3420. doi: 10.3390/ijms26073420.

Analytical and Clinical Validation of Solo-Test Driver: A Targeted Amplicon-Based NGS Test-System for FFPE and cfDNA Analysis in Clinical Oncology Setting.Solo-Test Driver的分析与临床验证：一种基于靶向扩增子的二代测序检测系统，用于临床肿瘤学环境中福尔马林固定石蜡包埋组织（FFPE）和游离DNA（cfDNA）分析

J Clin Lab Anal. 2025 Mar;39(6):e70008. doi: 10.1002/jcla.70008. Epub 2025 Mar 8.

Pervasive Conservation of Intron Number and Other Genetic Elements Revealed by a Chromosome-level Genome Assembly of the Hyper-polymorphic Nematode Caenorhabditis brenneri.通过高度多态性线虫布氏秀丽隐杆线虫的染色体水平基因组组装揭示内含子数量和其他遗传元件的普遍保守性

Genome Biol Evol. 2025 Mar 6;17(3). doi: 10.1093/gbe/evaf037.

Navigating Past Oceans: Comparing Metabarcoding and Metagenomics of Marine Ancient Sediment Environmental DNA.穿越海洋：比较海洋古代沉积物环境DNA的代谢条形码和宏基因组学

Mol Ecol Resour. 2025 Aug;25(6):e14086. doi: 10.1111/1755-0998.14086. Epub 2025 Feb 20.

Supersaturation mutagenesis reveals adaptive rewiring of essential genes among malaria parasites.过饱和诱变揭示了疟原虫中必需基因的适应性重连。

Science. 2025 Feb 7;387(6734):eadq7347. doi: 10.1126/science.adq7347.

De novo transcriptome assembly and discovery of drought-responsive genes in white spruce (Picea glauca).白云杉（Picea glauca）从头转录组组装及干旱响应基因发现

PLoS One. 2025 Jan 3;20(1):e0316661. doi: 10.1371/journal.pone.0316661. eCollection 2025.

BWT construction and search at the terabase scale.万亿碱基规模下的BWT构建与搜索。

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae717.

Evaluation of blood MSI burden dynamics to trace immune checkpoint inhibitor therapy efficacy through the course of treatment.评估血液 MSI 负担动态，以追踪免疫检查点抑制剂治疗疗效在治疗过程中的变化。

Sci Rep. 2024 Oct 8;14(1):23454. doi: 10.1038/s41598-024-73952-1.

soibean: High-Resolution Taxonomic Identification of Ancient Environmental DNA Using Mitochondrial Pangenome Graphs.苏拜恩：利用线粒体泛基因组图谱进行古代环境 DNA 的高分辨率分类鉴定。

Mol Biol Evol. 2024 Oct 4;41(10). doi: 10.1093/molbev/msae203.

本文引用的文献

Assemblathon 1: a competitive assessment of de novo short read assembly methods.Assemblathon 1：从头开始的短读序列组装方法的竞争性评估。

Genome Res. 2011 Dec;21(12):2224-41. doi: 10.1101/gr.126599.111. Epub 2011 Sep 16.

A framework for variation discovery and genotyping using next-generation DNA sequencing data.利用下一代 DNA 测序数据进行变异发现和基因分型的框架。

Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10.

Succinct data structures for assembling large genomes.用于组装大型基因组的简明数据结构。

Bioinformatics. 2011 Feb 15;27(4):479-86. doi: 10.1093/bioinformatics/btq697. Epub 2011 Jan 17.

High-quality draft assemblies of mammalian genomes from massively parallel sequence data.利用大规模平行测序数据生成高质量的哺乳动物基因组草图组装。

Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8. doi: 10.1073/pnas.1017351108. Epub 2010 Dec 27.

HiTEC: accurate error correction in high-throughput sequencing data.HiTEC：高通量测序数据中的精确错误校正。

Bioinformatics. 2011 Feb 1;27(3):295-302. doi: 10.1093/bioinformatics/btq653. Epub 2010 Nov 26.

Quake: quality-aware detection and correction of sequencing errors.Quake：测序错误的质量感知检测和校正。

Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. Epub 2010 Nov 29.

Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies.雷：同时组装来自多种高通量测序技术的读数。

J Comput Biol. 2010 Nov;17(11):1519-33. doi: 10.1089/cmb.2009.0238. Epub 2010 Oct 20.

Real-time DNA sequencing from single polymerase molecules.来自单个聚合酶分子的实时DNA测序。

Methods Enzymol. 2010;472:431-55. doi: 10.1016/S0076-6879(10)72001-2.

Efficient construction of an assembly string graph using the FM-index.利用 FM 索引高效构建组装字符串图。

Bioinformatics. 2010 Jun 15;26(12):i367-73. doi: 10.1093/bioinformatics/btq217.

The case for cloud computing in genome informatics.云计算在基因组信息学中的应用。

Genome Biol. 2010;11(5):207. doi: 10.1186/gb-2010-11-5-207. Epub 2010 May 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验