高效的短读映射到由 ED 字符串图表示的泛基因组。

Efficient short read mapping to a pangenome that is represented by a graph of ED strings.

机构信息

Institute of Theoretical Computer Science, Ulm University, 89075 Ulm, Germany.

出版信息

Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad320.

DOI:10.1093/bioinformatics/btad320

PMID:37171844

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10232250/

Abstract

MOTIVATION

A pangenome represents many diverse genome sequences of the same species. In order to cope with small variations as well as structural variations, recent research focused on the development of graph-based models of pangenomes. Mapping is the process of finding the original location of a DNA read in a reference sequence, typically a genome. Using a pangenome instead of a (linear) reference genome can, e.g. reduce mapping bias, the tendency to incorrectly map sequences that differ from the reference genome. Mapping reads to a graph, however, is more complex and needs more resources than mapping to a reference genome. Reducing the complexity of the graph by encoding simple variations like SNPs in a simple way can accelerate read mapping and reduce the memory requirements at the same time.

RESULTS

We introduce graphs based on elastic-degenerate strings (ED strings, EDS) and the linearized form of these EDS graphs as a new representation for pangenomes. In this representation, small variations are encoded directly in the sequence. Structural variations are encoded in a graph structure. This reduces the size of the representation in comparison to sequence graphs. In the linearized form, mapping techniques that are known from ordinary strings can be applied with appropriate adjustments. Since most variations are expressed directly in the sequence, the mapping process rarely has to take edges of the EDS graph into account. We developed a prototypical software tool GED-MAP that uses this representation together with a minimizer index to map short reads to the pangenome. Our experiments show that the new method works on a whole human genome scale, taking structural variants properly into account. The advantage of GED-MAP, compared with other pangenomic short read mappers, is that the new representation allows for a simple indexing method. This makes GED-MAP fast and memory efficient.

AVAILABILITY AND IMPLEMENTATION

Sources are available at: https://github.com/thomas-buechler-ulm/gedmap.

摘要

动机

泛基因组代表了同一物种的许多不同基因组序列。为了应对小的变异和结构变异，最近的研究集中在开发基于图的泛基因组模型上。映射是在参考序列（通常是基因组）中找到 DNA 读取的原始位置的过程。使用泛基因组而不是（线性）参考基因组可以例如减少映射偏差，即错误映射与参考基因组不同的序列的趋势。然而，将读取映射到图比映射到参考基因组更复杂，需要更多资源。通过以简单的方式对简单的变异（如 SNPs）进行编码，可以简化图的复杂性，同时加速读取映射并减少内存需求。

结果

我们引入了基于弹性退化字符串（ED 字符串，EDS）的图和这些 EDS 图的线性化形式作为泛基因组的新表示形式。在这种表示形式中，小的变异直接在序列中编码。结构变异以图形结构编码。与序列图相比，这减少了表示的大小。在线性化形式中，可以应用来自普通字符串的映射技术，并进行适当的调整。由于大多数变异直接在序列中表达，因此映射过程很少需要考虑 EDS 图的边。我们开发了一个原型软件工具 GED-MAP，该工具使用这种表示形式以及最小化器索引将短读取映射到泛基因组。我们的实验表明，该新方法可以在整个人类基因组规模上正常工作，适当考虑结构变体。与其他泛基因组短读取映射器相比，GED-MAP 的优势在于新的表示形式允许使用简单的索引方法。这使得 GED-MAP 快速且内存高效。