Novak Adam M, Chung Dickson, Hickey Glenn, Djebali Sarah, Yokoyama Toshiyuki T, Garrison Erik, Narzisi Giuseppe, Paten Benedict, Monlong Jean
UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA.
IRSD - Digestive Health Research Institute, University of Toulouse, INSERM, INRAE, ENVT, UPS, Toulouse, France.
bioRxiv. 2024 Oct 15:2024.10.12.618009. doi: 10.1101/2024.10.12.618009.
The current reference genome is the backbone of diverse and rich annotations. Simple text formats, like VCF or BED, have been widely adopted and helped the critical exchange of genomic information. There is a dire need for tools and formats enabling pangenomic annotation to facilitate such enrichment of pangenomic references. The Graph Alignment Format (GAF) is a text format, tab-delimited like BED/VCF files, which was proposed to represent alignments. GAF could also be used to store paths representing annotations in a pangenome graph, but there are no tools to index and query them efficiently. Here, we present extensions to vg and HTSlib that provide efficient sorting, indexing, and querying for GAF files. With this approach, annotations overlapping a subgraph can be extracted quickly. Paths are sorted based on the IDs of traversed nodes, compressed with BGZIP, and indexed with HTSlib/tabix via our extensions for the GAF format. Compared to the binary GAM format, GAF files are easier to edit or inspect because they are plain text, and we show that they are twice as fast to sort and half as large on disk. In addition, we updated vg annotate, which takes BED or GFF3 annotation files relative to linear sequences and projects them into the pangenome. It can now produce GAF files representing these annotations' paths through the pangenome. We showcase these new tools on several applications. We projected annotations for all Human Pangenome Reference Consortium Year 1 haplotypes, including genes, segmental duplications, tandem repeats and repeats annotations, into the Minigraph-Cactus pangenome (GRCh38-based v1.1). We also projected known variants from the GWAS Catalog and expression QTLs from the GTEx project into the pangenome. Finally, we reanalyzed ATAC-seq data from ENCODE to demonstrate what a coverage track could look like in a pangenome graph. These rich annotations can be quickly queried with vg and visualized using existing tools like the Sequence Tube Map or Bandage.
当前的参考基因组是各种丰富注释的基础。诸如VCF或BED之类的简单文本格式已被广泛采用,并有助于基因组信息的关键交换。迫切需要能够进行泛基因组注释的工具和格式,以促进泛基因组参考的这种丰富。图形比对格式(GAF)是一种文本格式,与BED/VCF文件一样以制表符分隔,它被提议用于表示比对。GAF也可用于存储表示泛基因组图中注释的路径,但目前还没有能够有效索引和查询它们的工具。在这里,我们展示了对vg和HTSlib的扩展,它们为GAF文件提供了高效的排序、索引和查询功能。通过这种方法,可以快速提取与子图重叠的注释。路径根据遍历节点的ID进行排序,使用BGZIP进行压缩,并通过我们对GAF格式的扩展使用HTSlib/tabix进行索引。与二进制GAM格式相比,GAF文件更容易编辑或检查,因为它们是纯文本,并且我们表明它们的排序速度快两倍,磁盘占用空间小一半。此外,我们更新了vg annotate,它接受相对于线性序列的BED或GFF3注释文件,并将它们投影到泛基因组中。现在它可以生成表示这些注释通过泛基因组的路径的GAF文件。我们在几个应用中展示了这些新工具。我们将所有人类泛基因组参考联盟第1年单倍型的注释,包括基因、片段重复、串联重复和重复注释,投影到Minigraph-Cactus泛基因组(基于GRCh38的v1.1)中。我们还将来自GWAS Catalog的已知变异和来自GTEx项目的表达QTL投影到泛基因组中。最后,我们重新分析了来自ENCODE的ATAC-seq数据,以展示泛基因组图中的覆盖轨迹会是什么样子。这些丰富的注释可以使用vg快速查询,并使用诸如序列管图或绷带等现有工具进行可视化。