Suppr超能文献

使用最小去环集保证小窗口的草图方法。

Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.

机构信息

Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

出版信息

J Comput Biol. 2024 Jul;31(7):597-615. doi: 10.1089/cmb.2024.0544. Epub 2024 Jul 9.

Abstract

Most sequence sketching methods work by selecting specific -mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the -mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a of the de Bruijn graph, which is a set of unavoidable -mers. Any long enough sequence, by definition, must contain a -mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the -mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.

摘要

大多数序列草图方法通过从序列中选择特定的 -mer 来工作,以便仅使用草图来估计两个序列之间的相似性。由于使用草图估计序列相似性比使用序列比对快得多,因此草图方法用于降低计算生物学软件的计算要求。使用草图的应用程序通常依赖于 -mer 选择过程的属性来确保使用草图不会降低结果的质量,而不是使用序列比对。这种属性的两个重要示例是局部性和窗口保证,后者确保序列的长区域不会在草图中表示。具有窗口保证的草图方法(隐式或显式)对应于 de Bruijn 图的一个分区,de Bruijn 图是一组不可避免的 -mer。任何足够长的序列,根据定义,必须包含来自任何非循环集的 -mer(因此,具有不可避免的属性)。相反,非循环集也通过从集中选择 -mer 作为代表来定义一种草图方法。尽管当前的方法使用少数几种草图方法家族中的一种,但非循环集的空间要大得多,而且大部分尚未开发。找到具有理想特征(例如,剩余路径长度小)的非循环集是发现具有改进性能的新草图方法的一种很有前途的方法(例如,窗口保证小)。最小生成树(MDSs)因其最小尺寸而特别有趣。尽管通常有大量替代 MDSs,但之前只知道 Mykkeltveit 和 Champarnaud 的两种算法可以生成两个特定的 MDSs。我们提供了一种简单的方法来枚举 MDSs。这种方法允许探索 MDSs 的空间,并找到针对理想特性进行优化的 MDSs。我们提供了证据表明,Mykkeltveit 集在特定属性(剩余路径长度)方面接近最优。提出了一些猜想和支持它们的计算和理论证据。代码可在 https://github.com/Kingsford-Group/mdsscope 获得。

相似文献

引用本文的文献

1

本文引用的文献

2
Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。
J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.
5
Parameterized syncmer schemes improve long-read mapping.参数化同步mers 方案提高了长读测序数据的比对效率。
PLoS Comput Biol. 2022 Oct 28;18(10):e1010638. doi: 10.1371/journal.pcbi.1010638. eCollection 2022 Oct.
9
Sequence-specific minimizers via polar sets.通过极集实现序列特异性最小化。
Bioinformatics. 2021 Jul 12;37(Suppl_1):i187-i195. doi: 10.1093/bioinformatics/btab313.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验