用于比较基因组组装的基于从头似然性的度量

De novo likelihood-based measures for comparing genome assemblies.

作者信息

Ghodsi Mohammadreza, Hill Christopher M, Astrovskaya Irina, Lin Henry, Sommer Dan D, Koren Sergey, Pop Mihai

机构信息

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA.

出版信息

BMC Res Notes. 2013 Aug 22;6:334. doi: 10.1186/1756-0500-6-334.

DOI:10.1186/1756-0500-6-334

PMID:23965294

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3765854/

Abstract

BACKGROUND

The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments "read" by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These "gold standards" can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics.

RESULTS

We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly "bake-offs" with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled.

CONCLUSION

Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation.

摘要

背景

基因组学领域当前的变革得益于被称为基因组组装软件的工具，这些工具将测序机器“读取”的DNA片段拼接成完整或近乎完整的基因组序列。尽管在该领域进行了数十年的研究并开发了数十种基因组组装软件，但评估和比较组装后的基因组序列质量仍依赖于独立确定的标准的可用性，例如人工编辑的基因组序列或独立生成的映射数据。这些“金标准”的生成成本可能很高，而且可能只覆盖基因组的一小部分，这限制了它们对新生成的基因组序列的适用性。在此，我们引入了一种从头计算组装质量的概率度量方法，该方法允许对从同一组读数生成的多个组装结果进行客观比较。我们将组装软件生成的序列质量定义为从组装序列中观察到测序读数的条件概率。我们的度量方法的一个关键特性是，与其他常用度量方法不同，真实的基因组序列会使分数最大化。

结果

我们证明，即使对于大型数据集，通过从相对较小的读数样本估计分数，我们的从头计算分数也可以在实际环境中快速准确地计算出来。为了证明我们分数的优势，我们用我们的度量方法测量了在GAGE和组装马拉松1组装“竞赛”中生成的组装结果的质量。即使不知道真实的参考序列，我们的从头计算度量方法也与研究中使用的基于参考的评估度量方法紧密匹配，并且优于传统上用于测量组装质量的其他从头计算度量方法（如N50）。最后，我们强调了我们的分数在优化基因组组装软件中使用的组装参数方面的应用，这使得即使在没有待组装基因组的先验知识的情况下也能产生更好的组装结果。

结论

基于似然性的度量方法，如我们在此提出的方法，将成为从头组装评估的新标准。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7ae2/3765854/f81c3570bacf/1756-0500-6-334-1.jpg

相似文献

De novo likelihood-based measures for comparing genome assemblies.用于比较基因组组装的基于从头似然性的度量

BMC Res Notes. 2013 Aug 22;6:334. doi: 10.1186/1756-0500-6-334.

Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.用于纳米孔数据的从头组装算法基准测试揭示了重叠布局一致（OLC）方法的最佳性能。

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.

GAM-NGS: genomic assemblies merger for next generation sequencing.GAM-NGS：用于下一代测序的基因组组装合并。

BMC Bioinformatics. 2013;14 Suppl 7(Suppl 7):S6. doi: 10.1186/1471-2105-14-S7-S6. Epub 2013 Apr 22.

dnAQET: a framework to compute a consolidated metric for benchmarking quality of de novo assemblies.dnAQET：一种用于计算从头组装质量基准测试综合指标的框架。

BMC Genomics. 2019 Sep 11;20(1):706. doi: 10.1186/s12864-019-6070-x.

HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads.HGA：一种利用高覆盖度短测序读段进行细菌基因组从头组装的方法。

BMC Genomics. 2016 Mar 5;17:193. doi: 10.1186/s12864-016-2515-7.

Employing whole genome mapping for optimal de novo assembly of bacterial genomes.采用全基因组图谱进行细菌基因组的最佳从头组装。

BMC Res Notes. 2014 Jul 30;7:484. doi: 10.1186/1756-0500-7-484.

GapFiller: a de novo assembly approach to fill the gap within paired reads.GapFiller：一种从头开始的组装方法，用于填补配对读取中的缺口。

BMC Bioinformatics. 2012;13 Suppl 14(Suppl 14):S8. doi: 10.1186/1471-2105-13-S14-S8. Epub 2012 Sep 7.

Novo&Stitch: accurate reconciliation of genome assemblies via optical maps.Novo&Stitch：通过光学图谱实现基因组组装的精确比对。

Bioinformatics. 2018 Jul 1;34(13):i43-i51. doi: 10.1093/bioinformatics/bty255.

Comparing de novo genome assembly: the long and short of it.从头开始比较基因组组装：长与短。

PLoS One. 2011 Apr 29;6(4):e19175. doi: 10.1371/journal.pone.0019175.

Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction.利用机器学习和比较基因组学进行重叠校正，提高从头序列组装质量。

BMC Bioinformatics. 2010 Jan 15;11:33. doi: 10.1186/1471-2105-11-33.

引用本文的文献

ScatTR: Estimating the Size of Long Tandem Repeat Expansions from Short-Reads.ScatTR：从短读长估计长串联重复序列的扩增大小。

bioRxiv. 2025 Feb 20:2025.02.15.638440. doi: 10.1101/2025.02.15.638440.

Theoretical Analysis of Sequencing Bioinformatics Algorithms and Beyond.测序生物信息学算法及其他方面的理论分析

Commun ACM. 2023 Jul;66(7):118-125. doi: 10.1145/3571723. Epub 2023 Jun 22.

High contiguity de novo genome assembly and DNA modification analyses for the fungus fly, Sciara coprophila, using single-molecule sequencing.利用单分子测序技术对粪蝇 Sciara coprophila 进行高连续性从头基因组组装和 DNA 修饰分析。

BMC Genomics. 2021 Sep 6;22(1):643. doi: 10.1186/s12864-021-07926-2.

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats.串联工具：长读映射和评估/改进超长串联重复序列的组装质量。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i75-i83. doi: 10.1093/bioinformatics/btaa440.

Lateral Gene Transfer Shapes Diversity of spp.横向基因转移塑造了[物种名称]的多样性

Front Cell Infect Microbiol. 2020 Jun 23;10:293. doi: 10.3389/fcimb.2020.00293. eCollection 2020.

Versatile genome assembly evaluation with QUAST-LG.QUAST-LG 进行多功能基因组组装评估。

Bioinformatics. 2018 Jul 1;34(13):i142-i150. doi: 10.1093/bioinformatics/bty266.

Adaptation in a Fibronectin Binding Autolysin of ..的纤连蛋白结合自溶素中的适应性

mSphere. 2017 Nov 29;2(6). doi: 10.1128/mSphere.00511-17. eCollection 2017 Nov-Dec.

Comparative scaffolding and gap filling of ancient bacterial genomes applied to two ancient genomes.比较古代细菌基因组的支架搭建和缺口填补，并应用于两个古代基因组。

Microb Genom. 2017 Jul 8;3(9):e000123. doi: 10.1099/mgen.0.000123. eCollection 2017 Sep.

Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes.通过验证的视角看宏基因组组装：评估和提高宏基因组组装基因组质量的最新进展。

Brief Bioinform. 2019 Jul 19;20(4):1140-1150. doi: 10.1093/bib/bbx098.

U: A New Metric for Measuring Assembly Output Based on Non-Overlapping, Target-Specific Contigs.U：一种基于非重叠、特定目标重叠群测量装配输出的新指标。

J Comput Biol. 2017 Nov;24(11):1071-1080. doi: 10.1089/cmb.2017.0013. Epub 2017 Apr 18.

本文引用的文献

CGAL: computing genome assembly likelihoods.CGAL：计算基因组组装似然值。

Genome Biol. 2013 Jan 29;14(1):R8. doi: 10.1186/gb-2013-14-1-r8.

ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies.ALE：一种通用的组装似然评估框架，用于评估基因组和宏基因组组装的准确性。

Bioinformatics. 2013 Feb 15;29(4):435-43. doi: 10.1093/bioinformatics/bts723. Epub 2013 Jan 9.

Fast gapped-read alignment with Bowtie 2.快速缺口读对准与 Bowtie 2。

Nat Methods. 2012 Mar 4;9(4):357-9. doi: 10.1038/nmeth.1923.

Feature-by-feature--evaluating de novo sequence assembly.逐特征评估从头序列组装。

PLoS One. 2012;7(2):e31002. doi: 10.1371/journal.pone.0031002. Epub 2012 Feb 3.

GAGE: A critical evaluation of genome assemblies and assembly algorithms.盖奇：基因组组装和算法的关键评估。

Genome Res. 2012 Mar;22(3):557-67. doi: 10.1101/gr.131383.111. Epub 2012 Jan 6.

Inferring viral quasispecies spectra from 454 pyrosequencing reads.从 454 焦磷酸测序读取中推断病毒准种谱。

BMC Bioinformatics. 2011;12 Suppl 6(Suppl 6):S1. doi: 10.1186/1471-2105-12-S6-S1. Epub 2011 Jul 28.

Assemblathon 1: a competitive assessment of de novo short read assembly methods.Assemblathon 1：从头开始的短读序列组装方法的竞争性评估。

Genome Res. 2011 Dec;21(12):2224-41. doi: 10.1101/gr.126599.111. Epub 2011 Sep 16.

Comparing de novo genome assembly: the long and short of it.从头开始比较基因组组装：长与短。

PLoS One. 2011 Apr 29;6(4):e19175. doi: 10.1371/journal.pone.0019175.

De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics.大规模平行测序和鸟枪法蛋白质组学从头组装和验证扁形动物转录组。

Genome Res. 2011 Jul;21(7):1193-200. doi: 10.1101/gr.113779.110. Epub 2011 May 2.

ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data.ShōRAH：基于下一代测序数据估计混合样本的遗传多样性。

BMC Bioinformatics. 2011 Apr 26;12:119. doi: 10.1186/1471-2105-12-119.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于比较基因组组装的基于从头似然性的度量

De novo likelihood-based measures for comparing genome assemblies.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献