Genomics Institute, Santa Cruz, CA 95064, USA.
Biomolecular Engineering and Bioinformatics, University of California Santa Cruz, Santa Cruz, CA 95064, USA.
Bioinformatics. 2021 Jan 29;36(21):5139-5144. doi: 10.1093/bioinformatics/btaa640.
Pangenomics is a growing field within computational genomics. Many pangenomic analyses use bidirected sequence graphs as their core data model. However, implementing and correctly using this data model can be difficult, and the scale of pangenomic datasets can be challenging to work at. These challenges have impeded progress in this field.
Here, we present a stack of two C++ libraries, libbdsg and libhandlegraph, which use a simple, field-proven interface, designed to expose elementary features of these graphs while preventing common graph manipulation mistakes. The libraries also provide a Python binding. Using a diverse collection of pangenome graphs, we demonstrate that these tools allow for efficient construction and manipulation of large genome graphs with dense variation. For instance, the speed and memory usage are up to an order of magnitude better than the prior graph implementation in the VG toolkit, which has now transitioned to using libbdsg's implementations.
libhandlegraph and libbdsg are available under an MIT License from https://github.com/vgteam/libhandlegraph and https://github.com/vgteam/libbdsg.
泛基因组学是计算基因组学领域中一个不断发展的领域。许多泛基因组分析使用有向序列图作为其核心数据模型。然而,实现和正确使用这个数据模型可能很困难,而且泛基因组数据集的规模也很难处理。这些挑战阻碍了该领域的进展。
在这里,我们提出了一个由两个 C++ 库组成的堆栈,libbdsg 和 libhandlegraph,它们使用简单、经过现场验证的接口,旨在暴露这些图的基本特征,同时防止常见的图操作错误。这些库还提供了一个 Python 绑定。使用各种泛基因组图谱,我们证明这些工具允许高效构建和操作具有密集变化的大型基因组图谱。例如,速度和内存使用量比 VG 工具包中以前的图实现要好一个数量级,VG 工具包现在已经过渡到使用 libbdsg 的实现。
libhandlegraph 和 libbdsg 可在 MIT 许可证下从 https://github.com/vgteam/libhandlegraph 和 https://github.com/vgteam/libbdsg 获得。