Bratcher Holly B, Corton Craig, Jolley Keith A, Parkhill Julian, Maiden Martin C J
Department of Zoology, University of Oxford, Oxford, UK.
BMC Genomics. 2014 Dec 18;15(1):1138. doi: 10.1186/1471-2164-15-1138.
Highly parallel, 'second generation' sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary.
The performance of de novo short-read assembly followed by automatic annotation using the pubMLST.org Neisseria database was assessed and evaluated for 108 diverse, representative, and well-characterised Neisseria meningitidis isolates. High-quality sequences were obtained for >99% of known meningococcal genes among the de novo assembled genomes and four resequenced genomes and less than 1% of reassembled genes had sequence discrepancies or misassembled sequences. A core genome of 1600 loci, present in at least 95% of the population, was determined using the Genome Comparator tool. Genealogical relationships compatible with, but at a higher resolution than, those identified by multilocus sequence typing were obtained with core genome comparisons and ribosomal protein gene analysis which revealed a genomic structure for a number of previously described phenotypes. This unified system for cataloguing Neisseria genetic variation in the genome was implemented and used for multiple analyses and the data are publically available in the PubMLST Neisseria database.
The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy. The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples. The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.
高度并行的“第二代”测序技术迅速增加了可供研究的细菌全基因组序列数量,促使群体基因组学这一学科的出现。这些数据大多以未组装的短读长序列文件形式公开提供,在用于分析之前需要进行大量处理。因此,有必要以统一格式提供数据,这种格式应易于评估质量、与出处和表型相关联并用于分析。
对108株不同、具有代表性且特征明确的脑膜炎奈瑟菌分离株进行了从头短读长组装,随后使用pubMLST.org奈瑟菌数据库进行自动注释,并对其性能进行了评估。在从头组装的基因组和四个重测序基因组中,超过99%的已知脑膜炎球菌基因获得了高质量序列,重新组装的基因中不到1%存在序列差异或组装错误的序列。使用基因组比较工具确定了一个由1600个位点组成的核心基因组,该基因组存在于至少95%的群体中。通过核心基因组比较和核糖体蛋白基因分析获得了与多位点序列分型所确定的亲缘关系相符但分辨率更高的谱系关系,揭示了许多先前描述的表型的基因组结构。实施了这个用于编目奈瑟菌基因组遗传变异的统一系统,并用于多项分析,数据可在PubMLST奈瑟菌数据库中公开获取。
从头组装结合逐个基因的自动注释,生成了高质量的基因组草图,其中大多数蛋白质编码基因都具有较高的准确性。该方法有效地编目了多样性,允许对单个基因组或多个基因组进行比较分析,是解释大型细菌群体样本的全基因组测序数据的实用方法。该方法为脑膜炎球菌的生物学特性提供了新的见解,不仅改善了我们对致病谱系的理解,还提高了我们对整个群体结构的认识。