Almutairy Meznah, Torng Eric
Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.
Department of Computer Science, College of Computer and Information Sciences, Imam Muhammad ibn Saud Islamic University, Riyadh, Saudi Arabia.
PLoS One. 2018 Feb 1;13(2):e0189960. doi: 10.1371/journal.pone.0189960. eCollection 2018.
Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method.
生物信息学应用程序和流程越来越多地使用k-mer索引来搜索相似序列。k-mer索引的主要问题在于它们需要大量内存。采样通常用于减小索引大小和查询时间。大多数应用程序使用两种主要采样类型之一:固定采样和最小化器采样。众所周知,固定采样会产生较小的索引,通常大约缩小为原来的二分之一,而一般认为最小化器采样会产生更快的查询时间,因为查询k-mer也可以进行采样。然而,尚未对固定采样和最小化器采样进行直接比较以验证这些假设。我们以人类基因组作为数据库,系统地比较固定采样和最小化器采样。我们使用固定采样和最小化器采样得到的k-mer索引,在我们的数据库(人类基因组)与三个单独的查询集(小鼠基因组、黑猩猩基因组和一个NGS数据集)之间查找所有最大精确匹配。我们得出以下结论。首先,使用更大的k-mer会减少固定采样和最小化器采样的查询时间,但代价是需要更多空间。如果我们对两种方法使用相同的k-mer大小,固定采样通常需要的空间只有一半,而最小化器采样处理查询的速度仅略快一点。如果我们可以为每种方法使用任何k-mer大小,那么我们可以选择一个k-mer大小使得固定采样不仅使用更少的空间,而且处理查询的速度比最小化器采样更快。原因是尽管最小化器采样能够对查询k-mer进行采样,但对于最小化器采样而言,必须处理的共享k-mer出现次数比固定采样要多得多。总之,我们认为对于任何必须处理每个共享k-mer出现情况的应用,固定采样是正确的采样方法。