1Pathogen Informatics, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, Cambridgeshire, UK.
2Biochemical Development, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, Cambridgeshire, UK.
Microb Genom. 2016 Aug 25;2(8):e000083. doi: 10.1099/mgen.0.000083. eCollection 2016 Aug.
The rapidly reducing cost of bacterial genome sequencing has lead to its routine use in large-scale microbial analysis. Though mapping approaches can be used to find differences relative to the reference, many bacteria are subject to constant evolutionary pressures resulting in events such as the loss and gain of mobile genetic elements, horizontal gene transfer through recombination and genomic rearrangements. assembly is the reconstruction of the underlying genome sequence, an essential step to understanding bacterial genome diversity. Here we present a high-throughput bacterial assembly and improvement pipeline that has been used to generate nearly 20 000 annotated draft genome assemblies in public databases. We demonstrate its performance on a public data set of 9404 genomes. We find all the genes used in multi-locus sequence typing schema present in 99.6 % of assembled genomes. When tested on low-, neutral- and high-GC organisms, more than 94 % of genes were present and completely intact. The pipeline has been proven to be scalable and robust with a wide variety of datasets without requiring human intervention. All of the software is available on GitHub under the GNU GPL open source license.
细菌基因组测序成本的迅速降低使得其在大规模微生物分析中得到了常规应用。虽然映射方法可用于发现相对于参考序列的差异,但许多细菌受到持续的进化压力的影响,导致诸如移动遗传元件的丢失和获得、通过重组和基因组重排的水平基因转移等事件。组装是对基础基因组序列的重建,是理解细菌基因组多样性的重要步骤。在这里,我们展示了一个高通量的细菌组装和改进管道,该管道已被用于在公共数据库中生成近 20000 个注释的草案基因组组装。我们在 9404 个基因组的公共数据集上演示了其性能。我们发现多基因序列分型方案中使用的所有基因都存在于 99.6%的组装基因组中。在对低、中、高 GC 生物进行测试时,超过 94%的基因存在且完全完整。该管道已被证明具有可扩展性和鲁棒性,可以处理各种数据集,而无需人工干预。所有软件都可以在 GitHub 上以 GNU GPL 开源许可证获得。