School of Biological Sciences, University of Manchester, Oxford Road, Manchester, M13 9PL, UK.
School of Mathematics, University of Leeds, Woodhouse, Leeds, LS2 9JT, UK.
BMC Bioinformatics. 2023 Oct 24;24(1):396. doi: 10.1186/s12859-023-05517-4.
Technical progress in computational hardware allows researchers to use new approaches for sequence alignment problems. For a given sequence, we usually use smaller subsequences (anchors) to find possible candidate positions within a reference sequence. We may create pairs ("position", "subsequence") for the reference sequence and keep all such records without compression, even on a budget computer. As sequences for new and reference genomes differ, the goal is to find anchors, so we tolerate differences and keep the number of candidate positions with the same anchors to a minimum. Spaced seeds (masks ignoring symbols at specific locations) are a way to approach the task. An ideal (full sensitivity) spaced seed should enable us to find all such positions subject to a given maximum number of mismatches permitted.
Several algorithms to assist seed generation are presented. The first one finds all permitted spaced seeds iteratively. We observe specific patterns for the seeds of the highest weight. There are often periodic seeds with a simple relation between block size, length of the seed and read. The second algorithm produces blocks for periodic seeds for blocks of up to 50 symbols and up to nine mismatches. The third algorithm uses those lists to find spaced seeds for reads of an arbitrary length. Finally, we apply seeds to a real dataset and compare results for other popular seeds.
PerFSeeB approach helps to significantly reduce the number of reads' possible alignment positions for a known number of mismatches. Lists of long, high-weight spaced seeds are available in Additional file 1. The seeds are best in weight compared to seeds from other papers and can usually be applied to shorter reads. Codes for all algorithms and periodic blocks can be found at https://github.com/vtman/PerFSeeB .
计算硬件技术的进步使得研究人员能够使用新方法来解决序列比对问题。对于给定的序列,我们通常使用较小的子序列(锚)在参考序列中找到可能的候选位置。我们可以为参考序列创建“位置”和“子序列”对,并保留所有这些记录,而无需进行压缩,即使在预算有限的计算机上也是如此。由于新序列和参考基因组的序列不同,因此目标是找到锚,因此我们容忍差异并将具有相同锚的候选位置数量保持在最小。间隔种子(忽略特定位置符号的掩码)是一种解决该任务的方法。理想的(全灵敏度)间隔种子应该使我们能够找到所有满足给定最大允许错配数的位置。
本文提出了几种辅助种子生成的算法。第一种算法迭代地找到所有允许的间隔种子。我们观察到最高权重种子的特定模式。通常存在具有简单块大小、种子长度和读取之间关系的周期性种子。第二种算法生成最多 50 个符号和最多 9 个错配的周期性种子块。第三种算法使用这些列表为任意长度的读取找到间隔种子。最后,我们将种子应用于真实数据集,并将结果与其他流行的种子进行比较。
PerFSeeB 方法有助于在已知错配数的情况下显著减少已知数量的读取的可能对齐位置的数量。较长、高权重的间隔种子列表可在附加文件 1 中获得。与其他论文中的种子相比,这些种子的权重最好,通常可以应用于较短的读取。所有算法和周期性块的代码可在 https://github.com/vtman/PerFSeeB 找到。