Suppr超能文献

有向图上的序列比对

Sequence Alignment on Directed Graphs.

作者信息

Kavya Vaddadi Naga Sai, Tayal Kshitij, Srinivasan Rajgopal, Sivadasan Naveen

机构信息

TCS Research, Hyderabad, India.

出版信息

J Comput Biol. 2019 Jan;26(1):53-67. doi: 10.1089/cmb.2017.0264. Epub 2018 Sep 8.

Abstract

Genomic variations in a reference collection are naturally represented as genome variation graphs. Such graphs encode common subsequences as vertices and the variations are captured using additional vertices and directed edges. The resulting graphs are directed graphs possibly with cycles. Existing algorithms for aligning sequences on such graphs make use of partial order alignment (POA) techniques that work on directed acyclic graphs (DAGs). To achieve this, acyclic extensions of the input graphs are first constructed through expensive loop unrolling steps (DAGification). Furthermore, such graph extensions could have considerable blowup in their size and in the worst case the blow-up factor is proportional to the input sequence length. We provide a novel alignment algorithm V-ALIGN that aligns the input sequence directly on the input graph while avoiding such expensive DAGification steps. V-ALIGN is based on a novel dynamic programming (DP) formulation that allows gapped alignment directly on the input graph. It supports affine and linear gaps. We also propose refinements to V-ALIGN for better performance in practice. With the proposed refinements, the time to fill the DP table has linear dependence on the sizes of the sequence, the graph, and its feedback vertex set. We conducted experiments to compare the proposed algorithm against the existing POA-based techniques. We also performed alignment experiments on the genome variation graphs constructed from the 1000 Genomes data. For aligning short sequences, standard approaches restrict the expensive gapped alignment to small filtered subgraphs having high similarity to the input sequence. In such cases, the performance of V-ALIGN for gapped alignment on the filtered subgraph depends on the subgraph sizes.

摘要

参考数据集中的基因组变异自然地表示为基因组变异图。此类图将公共子序列编码为顶点,并使用额外的顶点和有向边来捕获变异。所得的图是可能带有环的有向图。用于在此类图上比对序列的现有算法利用了适用于有向无环图(DAG)的偏序比对(POA)技术。为了实现这一点,首先要通过代价高昂的循环展开步骤(DAG化)来构建输入图的无环扩展。此外,此类图扩展在大小上可能会有相当大的膨胀,在最坏情况下,膨胀因子与输入序列长度成正比。我们提供了一种新颖的比对算法V-ALIGN,它能直接在输入图上比对输入序列,同时避免此类代价高昂的DAG化步骤。V-ALIGN基于一种新颖的动态规划(DP)公式,该公式允许直接在输入图上进行带间隙比对。它支持仿射间隙和线性间隙。我们还针对V-ALIGN提出了改进措施,以在实际应用中获得更好的性能。通过所提出的改进措施,填充DP表的时间与序列、图及其反馈顶点集的大小呈线性相关。我们进行了实验,将所提出的算法与现有的基于POA的技术进行比较。我们还对根据千人基因组数据构建的基因组变异图进行了比对实验。对于比对短序列,标准方法将代价高昂的带间隙比对限制在与输入序列具有高度相似性的小的过滤子图上。在这种情况下,V-ALIGN在过滤子图上进行带间隙比对的性能取决于子图的大小。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验