盖奇：基因组组装和算法的关键评估。

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

机构信息

McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.

出版信息

Genome Res. 2012 Mar;22(3):557-67. doi: 10.1101/gr.131383.111. Epub 2012 Jan 6.

DOI:10.1101/gr.131383.111

PMID:22147368

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3290791/

Abstract

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

摘要

新的测序技术极大地改变了全基因组测序的格局，使得科学家们能够启动众多项目来解码以前未测序的生物体的基因组。成本最低的技术可以在短短几天内对包括哺乳动物在内的大多数物种进行深度覆盖。这些项目之一生成的序列数据由数百万或数十亿个长度在 50 到 150nt 之间的短 DNA 序列（reads）组成。在大多数基因组分析开始之前，这些序列必须从头组装。不幸的是，基因组组装仍然是一个非常困难的问题，由于较短的读取和不可靠的长程连接信息而变得更加困难。在这项研究中，我们评估了几种领先的从头组装算法在四个不同的短读数据集上的性能，这些数据集都是由 Illumina 测序仪生成的。我们的结果描述了不同组装器的相对性能，以及似乎是基因组本身固有的其他显著的组装难度差异。有三个总体结论是显而易见的：首先，数据质量而不是组装器本身对组装基因组的质量有巨大影响；其次，组装的连续性程度在不同的组装器和不同的基因组之间差异巨大；第三，组装的正确性也差异很大，与连续性的统计数据相关性不大。为了使其他人能够复制我们的结果，我们所有的数据和方法都是免费提供的，本研究中使用的所有组装器也是免费提供的。

相似文献

GAGE: A critical evaluation of genome assemblies and assembly algorithms.盖奇：基因组组装和算法的关键评估。

Genome Res. 2012 Mar;22(3):557-67. doi: 10.1101/gr.131383.111. Epub 2012 Jan 6.

GABenchToB: a genome assembly benchmark tuned on bacteria and benchtop sequencers.GABenchToB：一个针对细菌和台式测序仪进行优化的基因组组装基准测试。

PLoS One. 2014 Sep 8;9(9):e107014. doi: 10.1371/journal.pone.0107014. eCollection 2014.

Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.用于纳米孔数据的从头组装算法基准测试揭示了重叠布局一致（OLC）方法的最佳性能。

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.

High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers.高质量的 3C 从头组装和耐药 ST-111 铜绿假单胞菌基因组的注释：杂交和非杂交组装器的基准测试。

Sci Rep. 2020 Jan 29;10(1):1392. doi: 10.1038/s41598-020-58319-6.

Benchmarking Long-Read Assemblers for Genomic Analyses of Bacterial Pathogens Using Oxford Nanopore Sequencing.基于 Oxford Nanopore 测序的细菌病原体基因组分析的长读长组装器基准测试

Int J Mol Sci. 2020 Dec 1;21(23):9161. doi: 10.3390/ijms21239161.

Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations.评估真核生物基因组的长读长从头组装工具：见解与考虑。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad100. Epub 2023 Nov 24.

High-quality draft assemblies of mammalian genomes from massively parallel sequence data.利用大规模平行测序数据生成高质量的哺乳动物基因组草图组装。

Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8. doi: 10.1073/pnas.1017351108. Epub 2010 Dec 27.

Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies.评估使用 Mate-Pairs 解决从头组装的短读 prokaryotic 重复的好处。

BMC Bioinformatics. 2011 Apr 13;12:95. doi: 10.1186/1471-2105-12-95.

Efficient de novo assembly of large genomes using compressed data structures.利用压缩数据结构进行高效的从头基因组组装。

Genome Res. 2012 Mar;22(3):549-56. doi: 10.1101/gr.126953.111. Epub 2011 Dec 7.

Completion of draft bacterial genomes by long-read sequencing of synthetic genomic pools.通过合成基因组文库的长读长测序完成细菌基因组草图

BMC Genomics. 2020 Jul 29;21(1):519. doi: 10.1186/s12864-020-06910-6.

引用本文的文献

Chromosome-scale scaffolds of the fungus gnat genome reveal multi-Mb-scale chromosome-folding interactions, centromeric enrichments of retrotransposons, and candidate telomere sequences.蕈蚊基因组的染色体水平支架揭示了多兆碱基规模的染色体折叠相互作用、反转录转座子的着丝粒富集以及候选端粒序列。

BMC Genomics. 2025 May 5;26(1):443. doi: 10.1186/s12864-025-11573-2.

Applying the Safe-And-Complete Framework to Practical Genome Assembly.将安全且完整框架应用于实际基因组组装。

Lebniz Int Proc Inform. 2024;312. doi: 10.4230/LIPIcs.WABI.2024.8. Epub 2024 Aug 26.

Interred mechanisms of resistance and host immune evasion revealed through network-connectivity analysis of complex graph pangenome.通过复杂图泛基因组的网络连通性分析揭示的抗性和宿主免疫逃逸的潜在机制。

mSystems. 2025 Apr 22;10(4):e0049924. doi: 10.1128/msystems.00499-24. Epub 2025 Mar 6.

Establishing genome sequencing and assembly for non-model and emerging model organisms: a brief guide.为非模式生物和新兴模式生物建立基因组测序与组装：简要指南

Front Zool. 2025 Apr 17;22(1):7. doi: 10.1186/s12983-025-00561-7.

A Hitchhiker's Guide to long-read genomic analysis.长读长基因组分析指南

Genome Res. 2025 Apr 14;35(4):545-558. doi: 10.1101/gr.279975.124.

Statistical Distributions of Genome Assemblies Reveal Random Effects in Ancient Viral DNA Reconstructions.基因组组装的统计分布揭示了古代病毒DNA重建中的随机效应。

Viruses. 2025 Jan 30;17(2):195. doi: 10.3390/v17020195.

PSAURON: a tool for assessing protein annotation across a broad range of species.PSAURON：一种用于评估广泛物种中蛋白质注释的工具。

NAR Genom Bioinform. 2025 Jan 7;7(1):lqae189. doi: 10.1093/nargab/lqae189. eCollection 2025 Mar.

Assessing the de novo assemblers: a metaviromic study of apple and first report of citrus concave gum-associated virus, apple rubbery wood virus 1 and 2 infecting apple in India.评估从头组装程序：苹果的宏病毒组学研究及柑橘凹点胶病毒、苹果橡胶木病毒 1 和 2 在印度感染苹果的首次报道。

BMC Genomics. 2024 Nov 8;25(1):1057. doi: 10.1186/s12864-024-10968-x.

GCI: a continuity inspector for complete genome assembly.GCI：用于完整基因组组装的连续性检查器。

Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae633.

Graphasing: phasing diploid genome assembly graphs with single-cell strand sequencing.Graphasing：利用单细胞测序进行二倍体基因组组装图谱的相位分析。

Genome Biol. 2024 Oct 10;25(1):265. doi: 10.1186/s13059-024-03409-1.

本文引用的文献

Efficient de novo assembly of large genomes using compressed data structures.利用压缩数据结构进行高效的从头基因组组装。

Genome Res. 2012 Mar;22(3):549-56. doi: 10.1101/gr.126953.111. Epub 2011 Dec 7.

Assemblathon 1: a competitive assessment of de novo short read assembly methods.Assemblathon 1：从头开始的短读序列组装方法的竞争性评估。

Genome Res. 2011 Dec;21(12):2224-41. doi: 10.1101/gr.126599.111. Epub 2011 Sep 16.

Bambus 2: scaffolding metagenomes.Bambus 2：支架宏基因组。

Bioinformatics. 2011 Nov 1;27(21):2964-71. doi: 10.1093/bioinformatics/btr520. Epub 2011 Sep 16.

Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals.通过对 18 名韩国个体的大规模平行 DNA 和 RNA 测序，鉴定出广泛的基因组和转录组多样性。

Nat Genet. 2011 Jul 3;43(8):745-52. doi: 10.1038/ng.872.

High-quality draft assemblies of mammalian genomes from massively parallel sequence data.利用大规模平行测序数据生成高质量的哺乳动物基因组草图组装。

Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8. doi: 10.1073/pnas.1017351108. Epub 2010 Dec 27.

Quake: quality-aware detection and correction of sequencing errors.Quake：测序错误的质量感知检测和校正。

Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. Epub 2010 Nov 29.

Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis.家鸡（Meleagris gallopavo）多平台新一代测序：基因组组装与分析。

PLoS Biol. 2010 Sep 7;8(9):e1000475. doi: 10.1371/journal.pbio.1000475.

Assembly of large genomes using second-generation sequencing.使用第二代测序技术进行大基因组组装。

Genome Res. 2010 Sep;20(9):1165-73. doi: 10.1101/gr.101360.109. Epub 2010 May 27.

Detection and correction of false segmental duplications caused by genome mis-assembly.检测和校正由基因组组装错误引起的假片段重复。

Genome Biol. 2010;11(3):R28. doi: 10.1186/gb-2010-11-3-r28. Epub 2010 Mar 10.

Complete Khoisan and Bantu genomes from southern Africa.完成来自南非的科伊桑和班图人的全基因组。

Nature. 2010 Feb 18;463(7283):943-7. doi: 10.1038/nature08795.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。