Department of Computer Science, ETH Zurich, Zurich, Switzerland.
Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.
Bioinformatics. 2019 Feb 1;35(3):407-414. doi: 10.1093/bioinformatics/bty632.
Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata.
We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain.
We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation.
Supplementary data are available at Bioinformatics online.
高通量 DNA 测序技术的进步导致了测序数据的指数级增长,并作为生物医学研究的副产品被存储。尽管这些数据是公开的,但由于缺乏有效的数据表示和索引解决方案,大多数数据仍然难以被研究界查询。表示读取数据的一种可用技术是将其压缩为组装图的形式。这种表示形式包含所有序列信息,但不存储上下文信息和元数据。
我们提出了两种新的图形着色压缩表示方法:一种是基于小波尝试的新颖应用的无损压缩方案,另一种是基于一组布隆过滤器的高度精确的有损压缩方案。这两种策略都能在添加底层图形拓扑结构的情况下保留颜色。我们提出了这两种方法的构建和合并过程,并在广泛的不同数据集上评估了它们的性能。通过放弃完全无损压缩的要求并使用底层图形的拓扑信息,我们可以将内存需求减少三个数量级。通过将单个颜色表示为独立存储的模块,我们的方法可以有效地并行化,并提供动态使用的策略。这些特性允许轻松扩展到生物医学领域常见的问题规模。
我们在 C++中提供了原型实现,实验摘要以及所有数据集的链接都可以在 https://github.com/ratschlab/graph_annotation 上公开获取。
补充数据可在《生物信息学》在线获取。