Big Data Institute, Nuffield Department of Population Health, University of Oxford, Oxford, UK.
Nuffield Department of Medicine, University of Oxford, Oxford, UK.
Microb Genom. 2022 Apr;8(4). doi: 10.1099/mgen.0.000804.
Comparative analysis of whole-genome sequencing (WGS) data enables fine scaled investigation of transmission and is increasingly becoming part of routine surveillance. However, these analyses are constrained by the computational requirements of the large volumes of data involved. By decomposing WGS reads or assemblies into k-mers and using the dimensionality reduction technique MinHash, it is possible to rapidly approximate genomic distances without alignment. Here we assessed the performance of MinHash, as implemented by sourmash, in predicting single nucleotide differences between genomes (SNPs) and ribotypes (RTs). For a set of 1905 diverse genomes (differing by 0-168 519 SNPs), using sourmash to screen for closely related genomes, at a sensitivity of 100 % for pairs ≤10 SNPs, sourmash reduced the number of pairs from 1 813 560 overall to 161 934, i.e. by 91 %, with a positive predictive value of 32 % to correctly identify pairs ≤10 SNPs (maximum SNP distance 4144). At a sensitivity of 95 %, pairs were reduced by 94 % to 108 266 and PPV increased to 45 % (maximum SNP distance 1009). Increasing the MinHash sketch size above 2000 produced minimal performance improvement. We also explored a MinHash similarity-based ribotype prediction method. Genomes with known ribotypes (=3937) were split into a training set (2937) and test set (1000) randomly. The training set was used to construct a sourmash index against which genomes from the test set were compared. If the closest five genomes in the index had the same ribotype this was taken to predict the searched genome's ribotype. Using our MinHash ribotype index, predicted ribotypes were correct in 780/1000 (78 %) genomes, incorrect in 20 (2 %), and indeterminant in 200 (20 %). Relaxing the classifier to 4/5 closest matches with the same RT improved the correct predictions to 87 %. Using MinHash it is possible to subsample genome k-mer hashes and use them to approximate small genomic differences within minutes, significantly reducing the search space for further analysis.
全基因组测序(WGS)数据的比较分析能够精细地研究传播情况,并且越来越多地成为常规监测的一部分。然而,这些分析受到涉及的大量数据的计算要求的限制。通过将 WGS 读取或组装分解为 k-mer,并使用降维技术 MinHash,可以在不进行比对的情况下快速近似基因组距离。在这里,我们评估了由 sourmash 实现的 MinHash 在预测基因组之间的单核苷酸差异(SNPs)和核糖体型(RTs)方面的性能。对于一组 1905 个不同的基因组(差异 0-168519 SNPs),使用 sourmash 筛选密切相关的基因组,在敏感性为 100%的情况下,对于≤10 SNPs 的对,sourmash 将对的数量从总共 1813560 对减少到 161934 对,即减少了 91%,正确识别≤10 SNPs 的对的阳性预测值为 32%(最大 SNP 距离为 4144)。在敏感性为 95%的情况下,对的数量减少了 94%,达到 108266 对,阳性预测值增加到 45%(最大 SNP 距离为 1009)。将 MinHash 草图大小增加到 2000 以上只会产生最小的性能提升。我们还探索了一种基于 MinHash 相似度的核糖体型预测方法。具有已知核糖体型的基因组(=3937)被随机分为训练集(2937)和测试集(1000)。使用训练集构建 sourmash 索引,然后将测试集中的基因组与该索引进行比较。如果索引中最接近的五个基因组具有相同的核糖体型,则将其用于预测搜索基因组的核糖体型。使用我们的 MinHash 核糖体型索引,在 1000 个搜索基因组中,780 个(78%)的预测核糖体型是正确的,20 个(2%)是错误的,200 个(20%)是不确定的。将分类器放宽到与相同 RT 最接近的 4/5 个匹配可以将正确预测提高到 87%。使用 MinHash,可以对基因组 k-mer 哈希进行抽样,并在几分钟内使用它们来近似较小的基因组差异,从而大大减少进一步分析的搜索空间。