Parasites and Microbes, The Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
Rollins School Public Health, Emory University, Atlanta, GA, USA.
Microb Genom. 2024 Aug;10(8). doi: 10.1099/mgen.0.001278.
Defining the population structure of a pathogen is a key part of epidemiology, as genomically related isolates are likely to share key clinical features such as antimicrobial resistance profiles and invasiveness. Multiple different methods are currently used to cluster together closely related genomes, potentially leading to inconsistency between studies. Here, we use a global dataset of 26 306 genomes to compare four clustering methods: gene-by-gene seven-locus MLST, core genome MLST (cgMLST)-based hierarchical clustering (HierCC) assignments, life identification number (LIN) barcoding and k-mer-based PopPUNK clustering (known as GPSCs in this species). We compare the clustering results with phylogenetic and pan-genome analyses to assess their relationship with genome diversity and evolution, as we would expect a good clustering method to form a single monophyletic cluster that has high within-cluster similarity of genomic content. We show that the four methods are generally able to accurately reflect the population structure based on these metrics and that the methods were broadly consistent with each other. We investigated further to study the discrepancies in clusters. The greatest concordance was seen between LIN barcoding and HierCC (adjusted mutual information score=0.950), which was expected given that both methods utilize cgMLST, but have different methods for defining an individual cluster and different core genome schema. However, the existence of differences between the two methods shows that the selection of a core genome schema can introduce inconsistencies between studies. GPSC and HierCC assignments were also highly concordant (AMI=0.946), showing that k-mer-based methods which use the whole genome and do not require the careful selection of a core genome schema are just as effective at representing the population structure. Additionally, where there were differences in clustering between these methods, this could be explained by differences in the accessory genome that were not identified in cgMLST. We conclude that for , standardized and stable nomenclature is important as the number of genomes available expands. Furthermore, the research community should transition away from seven-locus MLST, whilst cgMLST, GPSC and LIN assignments should be used more widely. However, to allow for easy comparison between studies and to make previous literature relevant, the reporting of multiple clustering names should be standardized within the research.
定义病原体的种群结构是流行病学的关键部分,因为基因组相关的分离株可能具有关键的临床特征,如抗生素耐药谱和侵袭性。目前有多种不同的方法用于将密切相关的基因组聚类在一起,这可能导致研究之间的不一致。在这里,我们使用了一个包含 26306 个基因组的全球数据集,比较了四种聚类方法:基因对基因的七个基因座 MLST、基于核心基因组 MLST(cgMLST)的层次聚类(HierCC)分配、生命识别号码(LIN)条形码和基于 k-mer 的 PopPUNK 聚类(在该物种中称为 GPSC)。我们将聚类结果与系统发育和泛基因组分析进行比较,以评估它们与基因组多样性和进化的关系,因为我们期望一个好的聚类方法能够形成一个具有高聚类内基因组内容相似性的单系聚类。我们表明,这四种方法通常能够根据这些指标准确反映种群结构,并且这些方法彼此之间基本一致。我们进一步研究了聚类之间的差异。LIN 条形码和 HierCC 之间的一致性最高(调整后的互信息评分=0.950),这是意料之中的,因为这两种方法都利用 cgMLST,但定义单个聚类和不同核心基因组方案的方法不同。然而,两种方法之间存在差异表明,核心基因组方案的选择可能会导致研究之间的不一致。GPSC 和 HierCC 分配也高度一致(AMI=0.946),表明使用整个基因组且不需要仔细选择核心基因组方案的基于 k-mer 的方法同样能够有效地表示种群结构。此外,在这些方法之间的聚类存在差异的情况下,可以用 cgMLST 未识别的附加基因组差异来解释。我们得出结论,随着可用基因组数量的增加,对于 来说,标准化和稳定的命名法很重要。此外,研究界应该逐步淘汰七个基因座 MLST,同时应更广泛地使用 cgMLST、GPSC 和 LIN 分配。然而,为了便于研究之间的比较,并使以前的文献具有相关性,应该在研究中标准化多种聚类名称的报告。