Suppr超能文献

具有重叠和间隙成本的共线链接算法。

Algorithms for Colinear Chaining with Overlaps and Gap Costs.

作者信息

Jain Chirag, Gibney Daniel, Thankachan Sharma V

机构信息

Department of Computational and Data Sciences, Indian Institute of Science, Bengaluru, India.

School of Computational Science and Engineering, Georgia Institute of Technology Atlanta, Georgia, USA.

出版信息

J Comput Biol. 2022 Nov;29(11):1237-1251. doi: 10.1089/cmb.2022.0266.

Abstract

Colinear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the colinear chaining problem with anchor overlaps and gap costs in denotes the count of anchors. The degree of the polylogarithmic factor depends on the type of anchors used (e.g., fixed-length anchors) and the type of precedence an optimal anchor chain is required to satisfy. We also establish the first theoretical connection between colinear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal "anchored" edit distance equals the optimal colinear chaining cost. The anchored edit distance for two sequences and a set of anchors is only a slight generalization of the standard edit distance. It adds an additional cost of one to an alignment of two matching symbols that are not supported by any anchor. Finally, we demonstrate experimentally that optimal colinear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient >0.9 with edit distance for closely as well as distantly related sequences.

摘要

共线链接已被证明是一种强大的启发式方法,用于将长DNA序列(例如,长读段或基因组组装)与参考序列进行近似最优比对。它被用作几种采用种子-链接-扩展策略的比对工具中的中间步骤。尽管很受欢迎,但目前尚不知道在链支持锚点重叠且存在间隙成本的一般情况下的高效亚二次时间算法。我们提出了在表示锚点数量的情况下解决带有锚点重叠和间隙成本的共线链接问题的算法。多对数因子的程度取决于所使用的锚点类型(例如,固定长度的锚点)以及要求最优锚点链满足的优先级类型。我们还建立了共线链接成本与编辑距离之间的第一个理论联系。具体来说,我们证明了在精心设计的链接成本函数下,对于一组固定的锚点,最优“锚定”编辑距离等于最优共线链接成本。两个序列和一组锚点的锚定编辑距离只是标准编辑距离的轻微推广。它会在两个没有任何锚点支持的匹配符号的比对上额外增加一个成本。最后,我们通过实验证明,在所提出的成本函数下,最优共线链接成本的计算速度比编辑距离快几个数量级,并且对于密切相关和远距离相关的序列,与编辑距离的相关系数>0.9。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验