UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA 95064, USA.
Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK.
Bioinformatics. 2020 Jan 15;36(2):400-407. doi: 10.1093/bioinformatics/btz575.
The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes.
We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.
Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2.
Supplementary data are available at Bioinformatics online.
变异图工具包(VG)将遗传变异表示为一个图。虽然图中的每条路径都是一个潜在的单倍型,但大多数路径都是真实单倍型的非生物学、不太可能的重组。
我们通过增加单倍型信息来扩充 VG 模型,以识别哪些路径更有可能在自然界中存在。为此,我们开发了一种可扩展的位置 Burrows-Wheeler 变换的图扩展实现。我们通过构建 1000 基因组计划的 5008 个单倍型的全基因组索引和 108070 个 Trans-Omics 用于精准医学冻结 5 号染色体 17 个单倍型的索引,展示了新实现的可扩展性。我们还开发了一种用于简化 k-mer 索引的变异图的算法,而不会在单倍型中丢失任何 k-mer。
我们的软件可在 https://github.com/vgteam/vg、https://github.com/jltsiren/gbwt 和 https://github.com/jltsiren/gcsa2 上获得。
补充数据可在 Bioinformatics 在线获得。