Roberts Miles D, Davis Olivia, Josephs Emily B, Williamson Robert J
ArXiv. 2024 Sep 18:arXiv:2409.11683v1.
Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes is challenging, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that $k$-mers are a crucial stepping stone to bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of $k$-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different $k$-mer-based measures of genetic variation behave in population genetic simulations according to the choice of $k$, depth of sequencing coverage, and degree of data compression. Overall, we find that $k$-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity ($\pi$) up to values of about $\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving populations. For populations with even more variation, using shorter $k$-mers will maintain the scalability up to at least $\pi = 0.1$. Furthermore, in our simulated populations, $k$-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of $k$-mer based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using $k$-mers.
许多常见的研究物种现在有不止一个染色体水平的基因组组装,揭示了大量以前通过将短读长映射到单个参考基因组的方法所遗漏的遗传多样性。然而,许多物种仍然缺乏多个参考基因组,并且将参考基因组正确比对以构建泛基因组具有挑战性,这限制了我们在群体遗传学中研究这种缺失的基因组变异的能力。在这里,我们认为k-mer是连接群体遗传学中以参考为中心的范式与泛基因组学中无参考范式的关键垫脚石。我们回顾了当前关于使用k-mer进行大多数群体遗传学分析的三个核心组成部分的文献:识别、测量和解释遗传变异模式。我们还展示了根据k的选择、测序覆盖深度和数据压缩程度,不同的基于k-mer的遗传变异测量方法在群体遗传模拟中的表现。总体而言,我们发现对于中性进化的群体,基于k-mer的遗传多样性测量与成对核苷酸多样性(π)一致,直到π约为0.025(R² = 0.97)。对于变异更多的群体,使用更短的k-mer将至少保持可扩展性到π = 0.1。此外,在我们的模拟群体中,k-mer差异值可以通过计数布隆过滤器可靠地近似,这突出了减少基于k-mer的基因组差异分析的内存负担的潜在途径。对于未来的研究,有很大的机会进一步开发使用k-mer识别选择位点的方法。