• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

泛基因组图中注释的高效索引与查询

Efficient indexing and querying of annotations in a pangenome graph.

作者信息

Novak Adam M, Chung Dickson, Hickey Glenn, Djebali Sarah, Yokoyama Toshiyuki T, Garrison Erik, Narzisi Giuseppe, Paten Benedict, Monlong Jean

机构信息

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA.

IRSD - Digestive Health Research Institute, University of Toulouse, INSERM, INRAE, ENVT, UPS, Toulouse, France.

出版信息

bioRxiv. 2024 Oct 15:2024.10.12.618009. doi: 10.1101/2024.10.12.618009.

DOI:10.1101/2024.10.12.618009
PMID:39464141
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11507721/
Abstract

The current reference genome is the backbone of diverse and rich annotations. Simple text formats, like VCF or BED, have been widely adopted and helped the critical exchange of genomic information. There is a dire need for tools and formats enabling pangenomic annotation to facilitate such enrichment of pangenomic references. The Graph Alignment Format (GAF) is a text format, tab-delimited like BED/VCF files, which was proposed to represent alignments. GAF could also be used to store paths representing annotations in a pangenome graph, but there are no tools to index and query them efficiently. Here, we present extensions to vg and HTSlib that provide efficient sorting, indexing, and querying for GAF files. With this approach, annotations overlapping a subgraph can be extracted quickly. Paths are sorted based on the IDs of traversed nodes, compressed with BGZIP, and indexed with HTSlib/tabix via our extensions for the GAF format. Compared to the binary GAM format, GAF files are easier to edit or inspect because they are plain text, and we show that they are twice as fast to sort and half as large on disk. In addition, we updated vg annotate, which takes BED or GFF3 annotation files relative to linear sequences and projects them into the pangenome. It can now produce GAF files representing these annotations' paths through the pangenome. We showcase these new tools on several applications. We projected annotations for all Human Pangenome Reference Consortium Year 1 haplotypes, including genes, segmental duplications, tandem repeats and repeats annotations, into the Minigraph-Cactus pangenome (GRCh38-based v1.1). We also projected known variants from the GWAS Catalog and expression QTLs from the GTEx project into the pangenome. Finally, we reanalyzed ATAC-seq data from ENCODE to demonstrate what a coverage track could look like in a pangenome graph. These rich annotations can be quickly queried with vg and visualized using existing tools like the Sequence Tube Map or Bandage.

摘要

当前的参考基因组是各种丰富注释的基础。诸如VCF或BED之类的简单文本格式已被广泛采用,并有助于基因组信息的关键交换。迫切需要能够进行泛基因组注释的工具和格式,以促进泛基因组参考的这种丰富。图形比对格式(GAF)是一种文本格式,与BED/VCF文件一样以制表符分隔,它被提议用于表示比对。GAF也可用于存储表示泛基因组图中注释的路径,但目前还没有能够有效索引和查询它们的工具。在这里,我们展示了对vg和HTSlib的扩展,它们为GAF文件提供了高效的排序、索引和查询功能。通过这种方法,可以快速提取与子图重叠的注释。路径根据遍历节点的ID进行排序,使用BGZIP进行压缩,并通过我们对GAF格式的扩展使用HTSlib/tabix进行索引。与二进制GAM格式相比,GAF文件更容易编辑或检查,因为它们是纯文本,并且我们表明它们的排序速度快两倍,磁盘占用空间小一半。此外,我们更新了vg annotate,它接受相对于线性序列的BED或GFF3注释文件,并将它们投影到泛基因组中。现在它可以生成表示这些注释通过泛基因组的路径的GAF文件。我们在几个应用中展示了这些新工具。我们将所有人类泛基因组参考联盟第1年单倍型的注释,包括基因、片段重复、串联重复和重复注释,投影到Minigraph-Cactus泛基因组(基于GRCh38的v1.1)中。我们还将来自GWAS Catalog的已知变异和来自GTEx项目的表达QTL投影到泛基因组中。最后,我们重新分析了来自ENCODE的ATAC-seq数据,以展示泛基因组图中的覆盖轨迹会是什么样子。这些丰富的注释可以使用vg快速查询,并使用诸如序列管图或绷带等现有工具进行可视化。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df62/11507721/1a867b643a4f/nihpp-2024.10.12.618009v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df62/11507721/cfdf8f0978c7/nihpp-2024.10.12.618009v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df62/11507721/2ded2ae8b8b5/nihpp-2024.10.12.618009v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df62/11507721/5b554cb7047f/nihpp-2024.10.12.618009v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df62/11507721/1a867b643a4f/nihpp-2024.10.12.618009v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df62/11507721/cfdf8f0978c7/nihpp-2024.10.12.618009v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df62/11507721/2ded2ae8b8b5/nihpp-2024.10.12.618009v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df62/11507721/5b554cb7047f/nihpp-2024.10.12.618009v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df62/11507721/1a867b643a4f/nihpp-2024.10.12.618009v1-f0004.jpg

相似文献

1
Efficient indexing and querying of annotations in a pangenome graph.泛基因组图中注释的高效索引与查询
bioRxiv. 2024 Oct 15:2024.10.12.618009. doi: 10.1101/2024.10.12.618009.
2
Pangenome graph construction from genome alignments with Minigraph-Cactus.基于 Minigraph-Cactus 的基因组比对构建泛基因组图谱。
Nat Biotechnol. 2024 Apr;42(4):663-673. doi: 10.1038/s41587-023-01793-w. Epub 2023 May 10.
3
GFF3sort: a novel tool to sort GFF3 files for tabix indexing.GFF3sort:一种用于为Tabix索引对GFF3文件进行排序的新型工具。
BMC Bioinformatics. 2017 Nov 14;18(1):482. doi: 10.1186/s12859-017-1930-3.
4
PPanG: a precision pangenome browser enabling nucleotide-level analysis of genomic variations in individual genomes and their graph-based pangenome.PPanG:一种精确的泛基因组浏览器,可对个体基因组中的基因组变异及其基于图的泛基因组进行核苷酸水平的分析。
BMC Genomics. 2024 Apr 24;25(1):405. doi: 10.1186/s12864-024-10302-5.
5
Comparing methods for constructing and representing human pangenome graphs.比较构建和表示人类泛基因组图的方法。
Genome Biol. 2023 Nov 30;24(1):274. doi: 10.1186/s13059-023-03098-2.
6
Haplotype-aware sequence alignment to pangenome graphs.基于单倍型感知的序列比对到泛基因组图谱。
Genome Res. 2024 Oct 11;34(9):1265-1275. doi: 10.1101/gr.279143.124.
7
A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar.用于处理 VCF 变体调用格式的一系列免费软件工具:vcflib、bio-vcf、cyvcf2、hts-nim 和 slivar。
PLoS Comput Biol. 2022 May 31;18(5):e1009123. doi: 10.1371/journal.pcbi.1009123. eCollection 2022 May.
8
Building a pangenome alignment index via recursive prefix-free parsing.通过递归无前缀解析构建泛基因组比对索引。
iScience. 2024 Sep 12;27(10):110933. doi: 10.1016/j.isci.2024.110933. eCollection 2024 Oct 18.
9
Efficient short read mapping to a pangenome that is represented by a graph of ED strings.高效的短读映射到由 ED 字符串图表示的泛基因组。
Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad320.
10
Unbiased pangenome graphs.无偏泛基因组图。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac743.

本文引用的文献

1
Building pangenome graphs.构建泛基因组图谱。
Nat Methods. 2024 Nov;21(11):2008-2012. doi: 10.1038/s41592-024-02430-3. Epub 2024 Oct 21.
2
Personalized pangenome references.个性化泛基因组参考序列。
Nat Methods. 2024 Nov;21(11):2017-2023. doi: 10.1038/s41592-024-02407-2. Epub 2024 Sep 11.
3
Pangenome graph layout by Path-Guided Stochastic Gradient Descent.基于路径引导随机梯度下降的泛基因组图谱布局。
Bioinformatics. 2024 Jul 1;40(7). doi: 10.1093/bioinformatics/btae363.
4
Pangenome graph construction from genome alignments with Minigraph-Cactus.基于 Minigraph-Cactus 的基因组比对构建泛基因组图谱。
Nat Biotechnol. 2024 Apr;42(4):663-673. doi: 10.1038/s41587-023-01793-w. Epub 2023 May 10.
5
Haplotype-aware pantranscriptome analyses using spliced pangenome graphs.基于拼接泛基因组图的单体型感知泛转录组分析。
Nat Methods. 2023 Feb;20(2):239-247. doi: 10.1038/s41592-022-01731-9. Epub 2023 Jan 16.
6
The UCSC Genome Browser database: 2023 update.UCSC 基因组浏览器数据库:2023 年更新。
Nucleic Acids Res. 2023 Jan 6;51(D1):D1188-D1195. doi: 10.1093/nar/gkac1072.
7
The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource.NHGRI-EBI GWAS 目录:知识库和存储资源。
Nucleic Acids Res. 2023 Jan 6;51(D1):D977-D985. doi: 10.1093/nar/gkac1010.
8
ODGI: understanding pangenome graphs.ODGI:理解泛基因组图谱。
Bioinformatics. 2022 Jun 27;38(13):3319-3326. doi: 10.1093/bioinformatics/btac308.
9
Pangenomics enables genotyping of known structural variants in 5202 diverse genomes.泛基因组学能够对 5202 个不同基因组中的已知结构变异进行基因分型。
Science. 2021 Dec 17;374(6574):abg8871. doi: 10.1126/science.abg8871.
10
Panache: a web browser-based viewer for linearized pangenomes.Panache:一个基于网络浏览器的线性化泛基因组查看器。
Bioinformatics. 2021 Dec 7;37(23):4556-4558. doi: 10.1093/bioinformatics/btab688.