Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
J Comput Biol. 2024 Jul;31(7):597-615. doi: 10.1089/cmb.2024.0544. Epub 2024 Jul 9.
Most sequence sketching methods work by selecting specific -mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the -mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a of the de Bruijn graph, which is a set of unavoidable -mers. Any long enough sequence, by definition, must contain a -mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the -mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.
大多数序列草图方法通过从序列中选择特定的 -mer 来工作,以便仅使用草图来估计两个序列之间的相似性。由于使用草图估计序列相似性比使用序列比对快得多,因此草图方法用于降低计算生物学软件的计算要求。使用草图的应用程序通常依赖于 -mer 选择过程的属性来确保使用草图不会降低结果的质量,而不是使用序列比对。这种属性的两个重要示例是局部性和窗口保证,后者确保序列的长区域不会在草图中表示。具有窗口保证的草图方法(隐式或显式)对应于 de Bruijn 图的一个分区,de Bruijn 图是一组不可避免的 -mer。任何足够长的序列,根据定义,必须包含来自任何非循环集的 -mer(因此,具有不可避免的属性)。相反,非循环集也通过从集中选择 -mer 作为代表来定义一种草图方法。尽管当前的方法使用少数几种草图方法家族中的一种,但非循环集的空间要大得多,而且大部分尚未开发。找到具有理想特征(例如,剩余路径长度小)的非循环集是发现具有改进性能的新草图方法的一种很有前途的方法(例如,窗口保证小)。最小生成树(MDSs)因其最小尺寸而特别有趣。尽管通常有大量替代 MDSs,但之前只知道 Mykkeltveit 和 Champarnaud 的两种算法可以生成两个特定的 MDSs。我们提供了一种简单的方法来枚举 MDSs。这种方法允许探索 MDSs 的空间,并找到针对理想特性进行优化的 MDSs。我们提供了证据表明,Mykkeltveit 集在特定属性(剩余路径长度)方面接近最优。提出了一些猜想和支持它们的计算和理论证据。代码可在 https://github.com/Kingsford-Group/mdsscope 获得。