Center for Applied Mathematics, Cornell University, Ithaca, NY 14853, USA.
Bioinformatics. 2013 Feb 15;29(4):435-43. doi: 10.1093/bioinformatics/bts723. Epub 2013 Jan 9.
Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies.
In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.
ALE is released as open source software under the UoI/NCSA license at http://www.alescore.org. It is implemented in C and Python.
研究人员需要通用的方法来客观评估单个人类基因组和宏基因组组装的准确性,并自动检测可能存在的任何错误。当前的方法不能完全满足这一需求,因为它们需要参考,只考虑组装质量的一个方面,或者缺乏统计依据,并且没有一个方法是专门为评估宏基因组组装而设计的。
在本文中,我们提出了一个组装似然评估(ALE)框架,该框架克服了这些限制,以无参考的方式系统地使用严格的统计方法评估组装的准确性。该框架是全面的,集成了读取质量、配对末端定向和插入长度(用于配对末端读取)、测序覆盖度、读取比对和 k-mer 频率。ALE 能够精确地指出单个人类基因组和宏基因组组装中的合成错误,包括单碱基错误、插入/缺失、基因组重排和宏基因组中出现的嵌合体组装。在具有真实数据的基因组水平上,ALE 从 Spirochaeta smaragdinae 完成的基因组中鉴定出三个大的组装错误,这些错误都被 Pacific Biosciences 测序独立验证。在 Illumina 数据的单碱基水平上,ALE 在一个富含 GC 的 Rhodobacter sphaeroides 基因组的训练集中,准确地恢复了 222 个单核苷酸变异中的 215 个(97%)。使用真实的 Pacific Biosciences 数据,ALE 在一个 Lambda Phage 基因组中识别出 12 个合成错误,甚至超过了 Pacific Biosciences 自己的变异调用器 EviCons。总之,ALE 框架提供了一种全面、无参考和统计学上严格的单个人类基因组和宏基因组组装准确性的衡量标准,可以用于识别错误组装或优化组装过程。
ALE 作为开源软件在 UoI/NCSA 许可证下发布,网址为 http://www.alescore.org。它是用 C 和 Python 实现的。