National Biodefense Analysis and Countermeasures Center, 110 Thomas Johnson Drive, Frederick, MD 21702, USA.
BMC Bioinformatics. 2014 May 3;15:126. doi: 10.1186/1471-2105-15-126.
The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible.
To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers.
Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.
DNA 测序的持续民主化引发了基因组组装和组装验证方法的新浪潮。随着越来越多的新基因组开始由个体研究实验室(而非集中式中心)测序,建立基因组组装的最佳实践变得尤为重要。然而,最近的评估(如 GAGE 和 Assemblathon)得出的结论是,没有一种单一的最佳方法适用于基因组组装。相反,生成多个组装并对其进行验证以确定哪个最适合所需的分析更为可取;这是一个劳动密集型的过程,通常是不可能或不可行的。
为了鼓励社区支持的最佳实践,我们提出了 iMetAMOS,这是一个自动化的集合组装管道;iMetAMOS 封装了运行、验证和从多个组装中选择单个组装的过程。iMetAMOS 将几个领先的开源工具打包到一个单一的二进制文件中,该文件自动选择参数并执行多个组装器,根据多个验证指标对生成的组装进行评分,并为基因和污染物注释组装。我们在 225 个以前未组装的结核分枝杆菌基因组以及 Rhodobacter sphaeroides 基准数据集上展示了集合过程的实用性。在这些真实数据上,iMetAMOS 可靠地生成了经过验证的组装,并在无需用户干预的情况下识别出潜在的污染。此外,智能参数选择产生的 R. sphaeroides 组装质量可与或超过 GAGE-B 评估的质量,从而影响一些组装器的相对排名。
使用 iMetAMOS 进行集合组装可为每个基因组提供多个经过验证的组装。尽管在计算上仅限于小或中等大小的基因组,但这种方法是生成高质量组装的最有效和可重复的手段,并使用户能够选择最适合其特定需求的组装。