• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用Zarr在生物样本库规模上生成可用于分析的VCF。

Analysis-ready VCF at Biobank scale using Zarr.

作者信息

Czech Eric, Millar Timothy R, Tyler Will, White Tom, Elsworth Benjamin, Guez Jérémy, Hancox Jonny, Jeffery Ben, Karczewski Konrad J, Miles Alistair, Tallman Sam, Unneberg Per, Wojdyla Rafal, Zabad Shadi, Hammerbacher Jeff, Kelleher Jerome

机构信息

Open Athena AI Foundation, Lincoln, New Zealand.

Related Sciences, Lincoln, New Zealand.

出版信息

bioRxiv. 2025 Feb 6:2024.06.11.598241. doi: 10.1101/2024.06.11.598241.

DOI:10.1101/2024.06.11.598241
PMID:38915693
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11195102/
Abstract

BACKGROUND

Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.

RESULTS

Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: =78,195; Our Future Health: =651,050; All of Us: =245,394) along with whole genome datasets for Norway Spruce (=1,063) and SARS-CoV-2 (=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.

CONCLUSIONS

Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.

摘要

背景

变异调用格式(VCF)是用于交换遗传变异数据及相关质量控制指标的标准文件格式。VCF数据模型通常的逐行编码(无论是文本形式还是打包二进制形式)强调了对给定变异的所有数据进行高效检索,但按字段或样本进行数据访问效率低下。目前可用的生物样本库规模的数据集包含数十万份全基因组数据以及数百太字节的压缩VCF文件。逐行数据存储从根本上说是不合适的,需要一种更具扩展性的方法。

结果

Zarr是一种用于存储多维数据的格式,在各学科中广泛使用,非常适合大规模并行处理。我们提出了VCF Zarr规范,即使用Zarr对VCF数据模型进行的一种编码,以及用于大规模高效可靠转换的基础软件基础设施。我们展示了这种格式如何比基于标准VCF的方法高效得多,并且在压缩率和单线程计算性能方面与存储基因型数据的专门方法具有竞争力。我们给出了关于三个大型人类数据集(英国基因组学:=78,195;我们的未来健康:=651,050;全民健康研究:=245,394)的子集以及挪威云杉(=1,063)和新冠病毒(=4,484,157)全基因组数据集的案例研究。我们通过使用云计算和GPU的示例展示了VCF Zarr实现新一代高性能且经济高效应用的潜力。

结论

大型逐行编码的VCF文件是当前研究的主要瓶颈,存储和处理这些文件会产生巨大成本。基于广泛使用的开源技术的VCF Zarr规范有潜力大幅降低这些成本,并可能催生一个多样化的下一代工具生态系统,可直接从基于云的对象存储中分析遗传变异数据,同时保持与现有面向文件的工作流程的兼容性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/6764926fc125/nihpp-2024.06.11.598241v3-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/c6eade5f7f7b/nihpp-2024.06.11.598241v3-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/8aba5a518fb9/nihpp-2024.06.11.598241v3-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/532b4c919e6b/nihpp-2024.06.11.598241v3-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/5001dd8852e6/nihpp-2024.06.11.598241v3-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/6764926fc125/nihpp-2024.06.11.598241v3-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/c6eade5f7f7b/nihpp-2024.06.11.598241v3-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/8aba5a518fb9/nihpp-2024.06.11.598241v3-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/532b4c919e6b/nihpp-2024.06.11.598241v3-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/5001dd8852e6/nihpp-2024.06.11.598241v3-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45b8/11828549/6764926fc125/nihpp-2024.06.11.598241v3-f0005.jpg

相似文献

1
Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可用于分析的VCF。
bioRxiv. 2025 Feb 6:2024.06.11.598241. doi: 10.1101/2024.06.11.598241.
2
Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可供分析的VCF。
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf049.
3
Short-Term Memory Impairment短期记忆障碍
4
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状Meta分析。
Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.
5
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
6
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.对紫杉醇、多西他赛、吉西他滨和长春瑞滨在非小细胞肺癌中的临床疗效和成本效益进行的快速系统评价。
Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.
7
AI-based Hepatic Steatosis Detection and Integrated Hepatic Assessment from Cardiac CT Attenuation Scans Enhances All-cause Mortality Risk Stratification: A Multi-center Study.基于人工智能的心脏CT衰减扫描检测肝脂肪变性及综合肝脏评估可增强全因死亡风险分层:一项多中心研究
medRxiv. 2025 Jun 11:2025.06.09.25329157. doi: 10.1101/2025.06.09.25329157.
8
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状荟萃分析。
Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.
9
Consequences, costs and cost-effectiveness of workforce configurations in English acute hospitals.英国急症医院劳动力配置的后果、成本及成本效益
Health Soc Care Deliv Res. 2025 Jul;13(25):1-107. doi: 10.3310/ZBAR9152.
10
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

本文引用的文献

1
Phasing millions of samples achieves near perfect accuracy, enabling parent-of-origin analyses.对数百万个样本进行定相可实现近乎完美的准确性,从而能够进行起源亲本分析。
HGG Adv. 2025 Jul 22;6(4):100479. doi: 10.1016/j.xhgg.2025.100479.
2
Efficient storage and regression computation for population-scale genome sequencing studies.针对群体规模基因组测序研究的高效存储与回归计算
Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf067.
3
Our Future Health: a unique global resource for discovery and translational research.我们的未来健康:一个用于发现和转化研究的独特全球资源。
Nat Med. 2025 Mar;31(3):728-730. doi: 10.1038/s41591-024-03438-0.
4
The scalable variant call representation: enabling genetic analysis beyond one million genomes.可扩展的变异调用表示:实现超百万基因组的遗传分析。
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae746.
5
Enabling efficient analysis of biobank-scale data with genotype representation graphs.利用基因型表示图实现生物样本库规模数据的高效分析。
Nat Comput Sci. 2025 Feb;5(2):112-124. doi: 10.1038/s43588-024-00739-9. Epub 2024 Dec 5.
6
Image processing tools for petabyte-scale light sheet microscopy data.用于拍字节级光片显微镜数据的图像处理工具。
Nat Methods. 2024 Dec;21(12):2342-2352. doi: 10.1038/s41592-024-02475-4. Epub 2024 Oct 17.
7
A call to action to scale up research and clinical genomic data sharing.扩大研究和临床基因组数据共享的行动呼吁。
Nat Rev Genet. 2025 Feb;26(2):141-147. doi: 10.1038/s41576-024-00776-0. Epub 2024 Oct 7.
8
The genomes of all lungfish inform on genome expansion and tetrapod evolution.所有肺鱼的基因组都为基因组扩张和四足动物进化提供了信息。
Nature. 2024 Oct;634(8032):96-103. doi: 10.1038/s41586-024-07830-1. Epub 2024 Aug 14.
9
A Genomics England haplotype reference panel and imputation of UK Biobank.英国基因组学公司单倍型参考面板和英国生物库的基因分型。
Nat Genet. 2024 Sep;56(9):1800-1803. doi: 10.1038/s41588-024-01868-7. Epub 2024 Aug 12.
10
GSC: efficient lossless compression of VCF files with fast query.GSC:实现 VCF 文件的高效无损压缩和快速查询
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae046.