Department of Biotechnology, University of the Western Cape, Bellville, South Africa.
Methods Mol Biol. 2023;2672:79-113. doi: 10.1007/978-1-0716-3226-0_4.
Recent advances in sequencing technologies have made genome sequencing of non-model organisms with very large and complex genomes possible. The data can be used to estimate diverse genome characteristics, including genome size, repeat content, and levels of heterozygosity. K-mer analysis is a powerful biocomputational approach with a wide range of applications, including estimation of genome sizes. However, interpretation of the results is not always straightforward. Here, I review k-mer-based genome size estimation, focusing specifically on k-mer theory and peak calling in k-mer frequency histograms. I highlight common pitfalls in data analysis and result interpretation, and provide a comprehensive overview on current methods and programs developed to conduct these analyses.
近年来,测序技术的进步使得对具有非常大和复杂基因组的非模式生物进行基因组测序成为可能。这些数据可用于估计各种基因组特征,包括基因组大小、重复序列含量和杂合度水平。K-mer 分析是一种强大的生物计算方法,具有广泛的应用,包括估计基因组大小。然而,结果的解释并不总是那么简单。在这里,我回顾了基于 K-mer 的基因组大小估计方法,重点介绍了 K-mer 理论和 K-mer 频率直方图中的峰调用。我强调了数据分析和结果解释中的常见陷阱,并提供了关于当前开发的用于进行这些分析的方法和程序的全面概述。