Suppr超能文献

使用Zarr在生物样本库规模上生成可供分析的VCF。

Analysis-ready VCF at Biobank scale using Zarr.

作者信息

Czech Eric, Tyler Will, White Tom, Jeffery Ben, Millar Timothy R, Elsworth Benjamin, Guez Jérémy, Hancox Jonny, Karczewski Konrad J, Miles Alistair, Tallman Sam, Unneberg Per, Wojdyla Rafal, Zabad Shadi, Hammerbacher Jeff, Kelleher Jerome

机构信息

Open Athena AI Foundation, 1245 Broadway, 16th Floor, New York, NY 10001, USA.

Related Sciences, 1312 17th St PMB 76870, Denver, CO 80202, USA.

出版信息

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf049.

Abstract

BACKGROUND

Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.

RESULTS

Zarr is a format for storing multidimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF-based approaches, and competitive with specialized methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of 3 large human datasets (Genomics England: $n$=78,195; Our Future Health: $n$=651,050; All of Us: $n$=245,394) along with whole genome datasets for Norway Spruce ($n$=1,063) and SARS-CoV-2 ($n$=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.

CONCLUSIONS

Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely used, open-source technologies, has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.

摘要

背景

变异调用格式(VCF)是用于交换遗传变异数据及相关质量控制指标的标准文件格式。VCF数据模型通常的逐行编码(无论是文本形式还是打包二进制形式)强调了对于给定变异的所有数据的高效检索,但按字段或样本访问数据效率低下。目前可用的生物样本库规模的数据集包含数十万份全基因组以及数百太字节的压缩VCF。逐行数据存储从根本上说是不合适的,需要一种更具扩展性的方法。

结果

Zarr是一种用于存储多维数据的格式,在各学科中广泛使用,并且非常适合大规模并行处理。我们提出了VCF Zarr规范,即使用Zarr对VCF数据模型进行的一种编码,以及用于大规模高效可靠转换的基础软件基础设施。我们展示了这种格式如何比基于标准VCF的方法高效得多,并且在压缩率和单线程计算性能方面与存储基因型数据的专门方法具有竞争力。我们给出了3个大型人类数据集(英国基因组学:n = 78,195;我们的未来健康:n = 651,050;全民健康研究:n = 245,394)的子集以及挪威云杉(n = 1,063)和严重急性呼吸综合征冠状病毒2(SARS-CoV-2,n = 4,484,157)的全基因组数据集的案例研究。我们通过使用云计算和图形处理器(GPU)的示例展示了VCF Zarr实现新一代高性能且经济高效的应用的潜力。

结论

大型逐行编码的VCF文件是当前研究的一个主要瓶颈,存储和处理这些文件会产生巨大成本。基于广泛使用的开源技术构建的VCF Zarr规范有潜力大幅降低这些成本,并且可能促成一个多样化的下一代工具生态系统,可直接从基于云的对象存储分析遗传变异数据,同时保持与现有面向文件的工作流程的兼容性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08a2/12127038/7f99dc84511a/giaf049fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验