Didelot Xavier, Ribeca Paolo
School of Life Sciences and Department of Statistics, University of Warwick, Coventry, UK.
NIHR Health Protection Research Unit in Genomics and Enabling Data, University of Warwick, Coventry, UK.
Genome Biol. 2025 Jun 18;26(1):170. doi: 10.1186/s13059-025-03585-8.
Here we introduce KPop, a novel versatile method based on full k-mer spectra and dataset-specific transformations, through which thousands of assembled or unassembled microbial genomes can be quickly compared. Unlike MinHash-based methods that produce distances and have lower resolution, KPop is able to accurately map sequences onto a low-dimensional space. Extensive validation on simulated and real-life viral and bacterial datasets shows that KPop can correctly separate sequences at both species and sub-species levels even when the overall genomic diversity is low. KPop also rapidly identifies related sequences and systematically outperforms MinHash-based methods.
在此,我们介绍KPop,这是一种基于完整k-mer谱和特定数据集转换的新型通用方法,通过该方法可以快速比较数千个已组装或未组装的微生物基因组。与基于MinHash的方法不同,后者产生距离且分辨率较低,KPop能够将序列准确地映射到低维空间。对模拟和真实病毒及细菌数据集的广泛验证表明,即使总体基因组多样性较低,KPop也能在物种和亚种水平上正确分离序列。KPop还能快速识别相关序列,并在系统性能上优于基于MinHash的方法。