Beller Timo, Ohlebusch Enno
Institute of Theoretical Computer Science, Ulm University, James-Franck-Ring O27/537, 89069 Ulm, Germany.
Algorithms Mol Biol. 2016 Jul 18;11:20. doi: 10.1186/s13015-016-0083-7. eCollection 2016.
Recently, Marcus et al. (Bioinformatics 30:3476-83, 2014) proposed to use a compressed de Bruijn graph to describe the relationship between the genomes of many individuals/strains of the same or closely related species. They devised an [Formula: see text] time algorithm called splitMEM that constructs this graph directly (i.e., without using the uncompressed de Bruijn graph) based on a suffix tree, where n is the total length of the genomes and g is the length of the longest genome. Baier et al. (Bioinformatics 32:497-504, 2016) improved their result.
In this paper, we propose a new space-efficient representation of the compressed de Bruijn graph that adds the possibility to search for a pattern (e.g. an allele-a variant form of a gene) within the pan-genome. The ability to search within the pan-genome graph is of utmost importance and is a design goal of pan-genome data structures.
最近,马库斯等人(《生物信息学》30:3476 - 83,2014年)提议使用压缩德布鲁因图来描述同一或密切相关物种的多个个体/菌株的基因组之间的关系。他们设计了一种名为splitMEM的[公式:见正文]时间算法,该算法基于后缀树直接构建此图(即不使用未压缩的德布鲁因图),其中n是基因组的总长度,g是最长基因组的长度。拜尔等人(《生物信息学》32:497 - 504,2016年)改进了他们的结果。
在本文中,我们提出了一种新的压缩德布鲁因图的空间高效表示方法,该方法增加了在泛基因组内搜索模式(例如等位基因——基因的一种变异形式)的可能性。在泛基因组图内进行搜索的能力至关重要,并且是泛基因组数据结构的一个设计目标。