• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

GVC:基因序列变异的高效随机访问压缩。

GVC: efficient random access compression for gene sequence variations.

机构信息

Institut für Informationsverarbeitung and L3S Research Center, Leibniz University Hannover, Hannover, Germany.

Institut für Nachrichtentechnik, RWTH Aachen University, Aachen, Germany.

出版信息

BMC Bioinformatics. 2023 Mar 28;24(1):121. doi: 10.1186/s12859-023-05240-0.

DOI:10.1186/s12859-023-05240-0
PMID:36978010
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10044409/
Abstract

BACKGROUND

In recent years, advances in high-throughput sequencing technologies have enabled the use of genomic information in many fields, such as precision medicine, oncology, and food quality control. The amount of genomic data being generated is growing rapidly and is expected to soon surpass the amount of video data. The majority of sequencing experiments, such as genome-wide association studies, have the goal of identifying variations in the gene sequence to better understand phenotypic variations. We present a novel approach for compressing gene sequence variations with random access capability: the Genomic Variant Codec (GVC). We use techniques such as binarization, joint row- and column-wise sorting of blocks of variations, as well as the image compression standard JBIG for efficient entropy coding.

RESULTS

Our results show that GVC provides the best trade-off between compression and random access compared to the state of the art: it reduces the genotype information size from 758 GiB down to 890 MiB on the publicly available 1000 Genomes Project (phase 3) data, which is 21% less than the state of the art in random-access capable methods.

CONCLUSIONS

By providing the best results in terms of combined random access and compression, GVC facilitates the efficient storage of large collections of gene sequence variations. In particular, the random access capability of GVC enables seamless remote data access and application integration. The software is open source and available at https://github.com/sXperfect/gvc/ .

摘要

背景

近年来,高通量测序技术的进步使得基因组信息在许多领域得到了应用,如精准医学、肿瘤学和食品质量控制。生成的基因组数据量正在迅速增长,预计很快将超过视频数据量。大多数测序实验,如全基因组关联研究,旨在识别基因序列中的变异,以更好地理解表型变异。我们提出了一种具有随机访问能力的基因序列变异压缩的新方法:基因组变异编解码器(GVC)。我们使用了二进制化、块的行和列联合排序以及 JBIG 图像压缩标准等技术,以实现高效的熵编码。

结果

我们的结果表明,与最先进的方法相比,GVC 在压缩和随机访问之间提供了最佳的折衷:它将公开可用的 1000 基因组计划(第 3 阶段)数据中的基因型信息大小从 758 GiB 减少到 890 MiB,比具有随机访问能力的方法减少了 21%。

结论

通过在随机访问和压缩方面提供最佳的结果,GVC 促进了大规模基因序列变异的高效存储。特别是,GVC 的随机访问能力实现了无缝的远程数据访问和应用程序集成。该软件是开源的,可在 https://github.com/sXperfect/gvc/ 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/d5c38189d855/12859_2023_5240_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/cae72df4b805/12859_2023_5240_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/7c945bbaaf64/12859_2023_5240_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/e0bcb84507ab/12859_2023_5240_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/5f7e10bcd37d/12859_2023_5240_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/cbfba638323e/12859_2023_5240_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/7ad671a2bd76/12859_2023_5240_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/68b9200aee4a/12859_2023_5240_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/d5c38189d855/12859_2023_5240_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/cae72df4b805/12859_2023_5240_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/7c945bbaaf64/12859_2023_5240_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/e0bcb84507ab/12859_2023_5240_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/5f7e10bcd37d/12859_2023_5240_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/cbfba638323e/12859_2023_5240_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/7ad671a2bd76/12859_2023_5240_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/68b9200aee4a/12859_2023_5240_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9c3/10044409/d5c38189d855/12859_2023_5240_Fig8_HTML.jpg

相似文献

1
GVC: efficient random access compression for gene sequence variations.GVC:基因序列变异的高效随机访问压缩。
BMC Bioinformatics. 2023 Mar 28;24(1):121. doi: 10.1186/s12859-023-05240-0.
2
AFRESh: an adaptive framework for compression of reads and assembled sequences with random access functionality.AFRESh:一种具有随机访问功能的用于压缩读取数据和组装序列的自适应框架。
Bioinformatics. 2017 May 15;33(10):1464-1472. doi: 10.1093/bioinformatics/btx001.
3
AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality.AQUa:一种具有随机访问功能的测序质量分数自适应压缩框架。
Bioinformatics. 2018 Feb 1;34(3):425-433. doi: 10.1093/bioinformatics/btx607.
4
WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams.WBFQC:一种将下一代测序数据分割为同质流进行压缩的新方法。
J Bioinform Comput Biol. 2018 Oct;16(5):1850018. doi: 10.1142/S021972001850018X. Epub 2018 Jun 28.
5
CALQ: compression of quality values of aligned sequencing data.CALQ:对齐测序数据的质量值压缩。
Bioinformatics. 2018 May 15;34(10):1650-1658. doi: 10.1093/bioinformatics/btx737.
6
SparkGC: Spark based genome compression for large collections of genomes.SparkGC:基于 Spark 的基因组压缩方法,适用于大规模基因组集合。
BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.
7
SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE:使用局部一致编码提升序列压缩算法。
Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.
8
Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads:算法与分析。
Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.
9
ERGC: an efficient referential genome compression algorithm.ERGC:一种高效的参考基因组压缩算法。
Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.
10
CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.查包算法:生物分子序列的可能性和分层参考表示及其在多重序列比对压缩中的应用。
PLoS One. 2022 Apr 18;17(4):e0265360. doi: 10.1371/journal.pone.0265360. eCollection 2022.

引用本文的文献

1
A benchmark study of compression software for human short-read sequence data.人类短读长序列数据压缩软件的基准研究。
Sci Rep. 2025 May 2;15(1):15358. doi: 10.1038/s41598-025-00491-8.
2
GSC: efficient lossless compression of VCF files with fast query.GSC:实现 VCF 文件的高效无损压缩和快速查询
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae046.

本文引用的文献

1
GTShark: genotype compression in large projects.GTShark:大型项目中的基因型压缩。
Bioinformatics. 2019 Nov 1;35(22):4791-4793. doi: 10.1093/bioinformatics/btz508.
2
CoMSA: compression of protein multiple sequence alignment files.CoMSA:蛋白质多重序列比对文件的压缩。
Bioinformatics. 2019 Jan 15;35(2):227-234. doi: 10.1093/bioinformatics/bty619.
3
GTC: how to maintain huge genotype collections in a compressed form.GTC:如何以压缩形式保存大型基因型集合。
Bioinformatics. 2018 Jun 1;34(11):1834-1840. doi: 10.1093/bioinformatics/bty023.
4
GTRAC: fast retrieval from compressed collections of genomic variants.GTRAC:从基因组变异的压缩集合中快速检索
Bioinformatics. 2016 Sep 1;32(17):i479-i486. doi: 10.1093/bioinformatics/btw437.
5
Big Data: Astronomical or Genomical?大数据:天文学的还是基因组学的?
PLoS Biol. 2015 Jul 7;13(7):e1002195. doi: 10.1371/journal.pbio.1002195. eCollection 2015 Jul.
6
Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT).利用位置 Burrows-Wheeler 变换 (PBWT) 实现高效单倍型匹配和存储。
Bioinformatics. 2014 May 1;30(9):1266-72. doi: 10.1093/bioinformatics/btu014. Epub 2014 Jan 9.
7
The variant call format and VCFtools.变异调用格式和 VCFtools。
Bioinformatics. 2011 Aug 1;27(15):2156-8. doi: 10.1093/bioinformatics/btr330. Epub 2011 Jun 7.
8
A map of human genome variation from population-scale sequencing.人类基因组变异的图谱来自于基于人群的测序。
Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534.