Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel.
Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel.
PLoS One. 2021 Oct 14;16(10):e0258693. doi: 10.1371/journal.pone.0258693. eCollection 2021.
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.
信息论方法在各种生物信息学应用中无处不在且非常有效。在比较基因组学中,基于短 DNA 单词或 k-mer 的无比对方法特别强大。我们通过分析 KEGG GENOME 数据库中 5805 个基因组的序列空间覆盖范围,评估了不同 k-mer 长度在基因组比较中的应用。在对跨越相关范围的四个 k-mer 长度(11、21、31 和 41)的后续分析中,使用成对的 21 和 31-mer Jaccard 相似性对 1634 个属水平代表基因组进行层次聚类,最好地再现了具有明确超界域边界的系统发育/分类树,并且在较低级别(从科到门)的命名分类群中具有较高的子树相似性。通过对其最低共同祖先分类群水平的约 1420 万个原核基因组进行分析,我们在一个经过精心整理的数据库中检测到许多潜在的错误分类错误,进一步证明需要广泛采用基于全基因组相似性的定量分类学分类。