Department of Biomolecular Engineering, University of California Santa Cruz Genomics Institute, Santa Cruz, CA 95060, USA.
Bioinformatics. 2020 Jul 1;36(Suppl_1):i146-i153. doi: 10.1093/bioinformatics/btaa446.
Graph representations of genomes are capable of expressing more genetic variation and can therefore better represent a population than standard linear genomes. However, due to the greater complexity of genome graphs relative to linear genomes, some functions that are trivial on linear genomes become much more difficult in genome graphs. Calculating distance is one such function that is simple in a linear genome but complicated in a graph context. In read mapping algorithms such distance calculations are fundamental to determining if seed alignments could belong to the same mapping.
We have developed an algorithm for quickly calculating the minimum distance between positions on a sequence graph using a minimum distance index. We have also developed an algorithm that uses the distance index to cluster seeds on a graph. We demonstrate that our implementations of these algorithms are efficient and practical to use for a new generation of mapping algorithms based upon genome graphs.
Our algorithms have been implemented as part of the vg toolkit and are available at https://github.com/vgteam/vg.
基因组的图形表示能够表达更多的遗传变异,因此可以比标准的线性基因组更好地表示群体。然而,由于基因组图形相对于线性基因组的复杂性增加,一些在线性基因组上很简单的功能在基因组图形中变得更加困难。计算距离就是这样一个功能,它在线性基因组中很简单,但在图形上下文中却很复杂。在读取映射算法中,这些距离计算对于确定种子比对是否可以属于同一映射是非常基础的。
我们开发了一种使用最小距离索引快速计算序列图上位置之间最小距离的算法。我们还开发了一种使用距离索引对图上的种子进行聚类的算法。我们证明,我们对这些算法的实现对于基于基因组图形的新一代映射算法来说是高效和实用的。
我们的算法已经作为 vg 工具包的一部分实现,并可在 https://github.com/vgteam/vg 上获得。