• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

AGC:带快速查询和更新功能的组装基因组的紧凑表示。

AGC: compact representation of assembled genomes with fast queries and updates.

机构信息

Department of Algorithmics and Software, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice 44-100, Poland.

Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.

出版信息

Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad097.

DOI:10.1093/bioinformatics/btad097
PMID:36864624
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9994791/
Abstract

MOTIVATION

High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets.

RESULTS

Here, we show how to reduce the size of the sequenced genomes by 2-3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in common pipelines. With the rapidly reduced cost and improved accuracy of sequencing technologies, we anticipate more comprehensive pangenome projects with much larger sample sizes. AGC is likely to become a foundation tool to store, distribute and access pangenome data.

AVAILABILITY AND IMPLEMENTATION

The source code of AGC is available at https://github.com/refresh-bio/agc. The package can be installed via Bioconda at https://anaconda.org/bioconda/agc.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

高质量的序列组装是个体完整遗传信息的最终表现。几个正在进行的泛基因组项目正在生成各种物种的高质量组装集。每个项目已经在磁盘上生成了数百千兆字节的组装,这极大地阻碍了这些丰富数据集的分发和访问。

结果

在这里,我们展示了如何将测序基因组的大小缩小 2-3 个数量级。我们的工具比现有程序显著更好地压缩基因组,并且速度更快。此外,它的独特功能是能够在几分之一秒内访问任何(或其部分)连续体,并轻松地将新样本附加到压缩集合中。由于这一点,AGC 不仅可用于备份或传输目的,而且还可用于常见管道中泛基因组序列的常规分析。随着测序技术成本的迅速降低和准确性的提高,我们预计会有更多具有更大样本量的综合泛基因组项目。AGC 很可能成为存储、分发和访问泛基因组数据的基础工具。

可用性和实施

AGC 的源代码可在 https://github.com/refresh-bio/agc 上获得。该软件包可通过 Bioconda 在 https://anaconda.org/bioconda/agc 上安装。

补充信息

补充数据可在生物信息学在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/707c/9994791/0b80b80265ef/btad097f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/707c/9994791/dd7a97a617de/btad097f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/707c/9994791/a7428680f8c5/btad097f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/707c/9994791/0b80b80265ef/btad097f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/707c/9994791/dd7a97a617de/btad097f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/707c/9994791/a7428680f8c5/btad097f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/707c/9994791/0b80b80265ef/btad097f3.jpg

相似文献

1
AGC: compact representation of assembled genomes with fast queries and updates.AGC:带快速查询和更新功能的组装基因组的紧凑表示。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad097.
2
ODGI: understanding pangenome graphs.ODGI:理解泛基因组图谱。
Bioinformatics. 2022 Jun 27;38(13):3319-3326. doi: 10.1093/bioinformatics/btac308.
3
Unbiased pangenome graphs.无偏泛基因组图。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac743.
4
NextPolish: a fast and efficient genome polishing tool for long-read assembly.NextPolish:一种用于长读长组装的快速高效基因组精修工具。
Bioinformatics. 2020 Apr 1;36(7):2253-2255. doi: 10.1093/bioinformatics/btz891.
5
FastRemap: a tool for quickly remapping reads between genome assemblies.FastRemap:一种快速在基因组组装之间重新映射读取的工具。
Bioinformatics. 2022 Sep 30;38(19):4633-4635. doi: 10.1093/bioinformatics/btac554.
6
orfipy: a fast and flexible tool for extracting ORFs.orfipy:一个快速灵活的 ORF 提取工具。
Bioinformatics. 2021 Sep 29;37(18):3019-3020. doi: 10.1093/bioinformatics/btab090.
7
plotsr: visualizing structural similarities and rearrangements between multiple genomes.plotsr:可视化多个基因组之间的结构相似性和重排。
Bioinformatics. 2022 May 13;38(10):2922-2926. doi: 10.1093/bioinformatics/btac196.
8
GTC: how to maintain huge genotype collections in a compressed form.GTC:如何以压缩形式保存大型基因型集合。
Bioinformatics. 2018 Jun 1;34(11):1834-1840. doi: 10.1093/bioinformatics/bty023.
9
Gfastats: conversion, evaluation and manipulation of genome sequences using assembly graphs.Gfastats:使用组装图转换、评估和操作基因组序列。
Bioinformatics. 2022 Sep 2;38(17):4214-4216. doi: 10.1093/bioinformatics/btac460.
10
Identifying and removing haplotypic duplication in primary genome assemblies.鉴定和去除初级基因组组装中的单倍型重复。
Bioinformatics. 2020 May 1;36(9):2896-2898. doi: 10.1093/bioinformatics/btaa025.

引用本文的文献

1
Loss of CFHR5 function reduces the risk for age-related macular degeneration.CFHR5功能丧失可降低年龄相关性黄斑变性的风险。
Nat Commun. 2025 Jul 1;16(1):5766. doi: 10.1038/s41467-025-61193-3.
2
Efficient and robust search of microbial genomes via phylogenetic compression.通过系统发育压缩对微生物基因组进行高效且稳健的搜索。
Nat Methods. 2025 Apr;22(4):692-697. doi: 10.1038/s41592-025-02625-2. Epub 2025 Apr 9.
3
JARVIS3: an efficient encoder for genomic data.JARVIS3:一种用于基因组数据的高效编码器。

本文引用的文献

1
The Human Pangenome Project: a global resource to map genomic diversity.人类泛基因组计划:绘制基因组多样性图谱的全球资源。
Nature. 2022 Apr;604(7906):437-446. doi: 10.1038/s41586-022-04601-8. Epub 2022 Apr 20.
2
The complete sequence of a human genome.人类基因组的完整序列。
Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.
3
Genomic variations and epigenomic landscape of the Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel.日本青鳉近交系 Kiyosu-Karlsruhe(MIKK)panel 的基因组变异和表观基因组景观。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae725.
4
BWT construction and search at the terabase scale.万亿碱基规模下的BWT构建与搜索。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae717.
5
Loss of function reduces the risk for age-related macular degeneration.功能丧失可降低年龄相关性黄斑变性的风险。
medRxiv. 2024 Nov 11:2024.11.11.24317117. doi: 10.1101/2024.11.11.24317117.
6
Complex genetic variation in nearly complete human genomes.近乎完整的人类基因组中的复杂遗传变异。
bioRxiv. 2024 Sep 25:2024.09.24.614721. doi: 10.1101/2024.09.24.614721.
7
Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References.超越人类基因组计划:完整人类基因组序列和泛基因组参考时代。
Annu Rev Genomics Hum Genet. 2024 Aug;25(1):77-104. doi: 10.1146/annurev-genom-021623-081639. Epub 2024 Aug 6.
8
AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.AlcoR:生物数据中低复杂度区域的无比对模拟、映射和可视化。
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad101. Epub 2023 Dec 13.
9
Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes.泛基因组的多尺度分析能够改善对重复和临床相关基因的基因组多样性的表示。
Nat Methods. 2023 Aug;20(8):1213-1221. doi: 10.1038/s41592-023-01914-y. Epub 2023 Jun 26.
10
Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.使用核苷酸存档格式对严重急性呼吸综合征冠状病毒2(SARS-CoV-2)基因组数据进行高效压缩。
Patterns (N Y). 2022 Sep 9;3(9):100562. doi: 10.1016/j.patter.2022.100562. Epub 2022 Jul 7.
Genome Biol. 2022 Feb 21;23(1):58. doi: 10.1186/s13059-022-02602-4.
4
MBGC: Multiple Bacteria Genome Compressor.MBGC:多细菌基因组压缩器。
Gigascience. 2022 Jan 27;11. doi: 10.1093/gigascience/giab099.
5
Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.通过对存档DNA序列的精心整理和可搜索快照探索细菌多样性。
PLoS Biol. 2021 Nov 9;19(11):e3001421. doi: 10.1371/journal.pbio.3001421. eCollection 2021 Nov.
6
The Need for a Human Pangenome Reference Sequence.人类泛基因组参考序列的需求。
Annu Rev Genomics Hum Genet. 2021 Aug 31;22:81-102. doi: 10.1146/annurev-genom-120120-081921. Epub 2021 Apr 30.
7
Haplotype-resolved diverse human genomes and integrated analysis of structural variation.单体型解析的多样化人类基因组和结构变异的综合分析。
Science. 2021 Apr 2;372(6537). doi: 10.1126/science.abf7117. Epub 2021 Feb 25.
8
Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm.使用带有 hifiasm 的相定装配图进行单体型解析从头组装。
Nat Methods. 2021 Feb;18(2):170-175. doi: 10.1038/s41592-020-01056-5. Epub 2021 Feb 1.
9
The barley pan-genome reveals the hidden legacy of mutation breeding.大麦泛基因组揭示了诱变育种的隐藏遗产。
Nature. 2020 Dec;588(7837):284-289. doi: 10.1038/s41586-020-2947-8. Epub 2020 Nov 25.
10
Efficient DNA sequence compression with neural networks.神经网络高效 DNA 序列压缩。
Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.