Department of Computer Science, and Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
J Comput Biol. 2024 Jan;31(1):2-20. doi: 10.1089/cmb.2023.0212. Epub 2023 Nov 17.
Minimizers and syncmers are sketching methods that sample representative -mer seeds from a long string. The minimizer scheme guarantees a well-spread -mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.
最小化器和同步器是从长字符串中采样代表性 -mer 种子的草图方法。最小化器方案保证了良好分布的 -mer 草图(高覆盖率),同时力求最小化草图大小(低密度)。同步器方案在随机序列上产生更稳健的草图(高保守性),但没有最小化器的覆盖率保证。这些草图指标通常相互矛盾,特别是在针对特定序列进行草图优化的情况下,因此很难同时实现。参数化同步器方案最近被引入,作为同步器的推广,具有更灵活的采样规则,在经验上比原始同步器变体具有更好的覆盖率。然而,目前还没有针对参数化同步器进行优化的方法。为了解决这个问题,我们引入了一种新的方案,称为掩蔽最小化器,它以类似于参数化同步器推广同步器的方式推广最小化器,并允许我们扩展针对最小化器开发的现有优化技术。这导致了一种针对密度和保守性优化掩蔽最小化器方案的实用算法。我们在各种基准基因组上评估了优化算法,并表明我们的算法找到了比以前的方法更紧凑、分布更均匀、对替换更稳健的草图。我们的实现发布在 https://github.com/Kingsford-Group/maskedminimizer。这项新技术将使最小化器和同步器在许多应用场景中能够更有效地进行基因组分析。