Salikhov Kamil, Sacomoto Gustavo, Kucherov Gregory
Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel.
Algorithms Mol Biol. 2014 Feb 24;9(1):2. doi: 10.1186/1748-7188-9-2.
De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently.
In this work, we show how to reduce the memory required by the data structure of Chikhi and Rizk (WABI'12) that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to their method, with insignificant impact on construction time. At the same time, our experiments showed a better query time compared to the method of Chikhi and Rizk.
The proposed data structure constitutes, to our knowledge, currently the most efficient practical representation of de Bruijn graphs.
德布鲁因图在生物信息学中被广泛用于处理下一代测序数据。由于下一代测序数据集规模非常大,紧凑地表示德布鲁因图至关重要,最近已经提出了几种解决该问题的方法。
在这项工作中,我们展示了如何减少Chikhi和Rizk(WABI'12)的数据结构所需的内存,该数据结构使用布隆过滤器来表示德布鲁因图。相对于他们的方法,我们的方法所需内存减少30%至40%,对构建时间的影响微不足道。同时,我们的实验表明,与Chikhi和Rizk的方法相比,查询时间更短。
据我们所知,所提出的数据结构构成了目前德布鲁因图最有效的实际表示。