• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

可扩展的变异调用表示:实现超百万基因组的遗传分析。

The scalable variant call representation: enabling genetic analysis beyond one million genomes.

作者信息

Poterba Timothy, Vittal Christopher, King Daniel, Goldstein Daniel, Goldstein Jacqueline I, Schultz Patrick, Karczewski Konrad J, Seed Cotton, Neale Benjamin M

机构信息

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.

Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, United States.

出版信息

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae746.

DOI:10.1093/bioinformatics/btae746
PMID:39718771
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11745898/
Abstract

MOTIVATION

The Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150 000 genome VCF would occupy 900 TiB, making it costly and complicated to produce, analyze, and store. The issue stems from VCF's requirement to densely represent both reference-genotypes and allele-indexed arrays. These requirements lead to unnecessary data duplication and, ultimately, very large files.

RESULTS

To address these challenges, we introduce the Scalable Variant Call Representation (SVCR). This representation reduces file sizes by ensuring they scale linearly with samples. SVCR's linear scaling relies on two techniques, both necessary for linearity: local allele indices and reference blocks, which were first introduced by the Genomic Variant Call Format. SVCR is also lossless and mergeable, allowing for N + 1 and N + K incremental joint-calling. We present two implementations of SVCR: SVCR-VCF, which encodes SVCR in VCF format, and VDS, which uses Hail's native format. Our experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files. We also discuss the VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis. SVCR, and VDS in particular, ensure the scientific community can generate, analyze, and disseminate genetics datasets with millions of samples.

AVAILABILITY AND IMPLEMENTATION

https://github.com/hail-is/hail/.

摘要

动机

变异调用格式(VCF)在基因组测序中被广泛使用,但扩展性较差。例如,我们估计一个包含150000个基因组的VCF将占用900 TiB,这使得其生成、分析和存储成本高昂且复杂。问题源于VCF需要密集表示参考基因型和等位基因索引数组。这些要求导致了不必要的数据重复,最终产生了非常大的文件。

结果

为应对这些挑战,我们引入了可扩展变异调用表示(SVCR)。这种表示通过确保文件大小随样本数量线性扩展来减小文件大小。SVCR的线性扩展依赖于两种技术,这两种技术对于线性扩展都是必需的:局部等位基因索引和参考块,它们最初由基因组变异调用格式引入。SVCR也是无损且可合并的,允许进行N + 1和N + K增量联合调用。我们展示了SVCR的两种实现方式:SVCR-VCF,它以VCF格式编码SVCR;以及VDS,它使用Hail的原生格式。我们的实验证实了SVCR-VCF和VDS的线性可扩展性,这与标准VCF文件的超线性增长形成对比。我们还讨论了VDS合并器,这是一个用于从GVCF生成VDS的可扩展开源工具,以及VDS能够实现快速数据分析的独特功能。SVCR,特别是VDS,确保了科学界能够生成、分析和传播包含数百万样本的遗传学数据集。

可用性和实现方式

https://github.com/hail-is/hail/ 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/25bd01ff2399/btae746f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/5ef07d0251ff/btae746f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/11131f910766/btae746f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/0e8e70bb9735/btae746f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/23ff0782dabb/btae746f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/c0a815bc9294/btae746f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/25bd01ff2399/btae746f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/5ef07d0251ff/btae746f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/11131f910766/btae746f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/0e8e70bb9735/btae746f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/23ff0782dabb/btae746f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/c0a815bc9294/btae746f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9ad/11745898/25bd01ff2399/btae746f6.jpg

相似文献

1
The scalable variant call representation: enabling genetic analysis beyond one million genomes.可扩展的变异调用表示:实现超百万基因组的遗传分析。
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae746.
2
The Scalable Variant Call Representation: Enabling Genetic Analysis Beyond One Million Genomes.可扩展变异调用表示法:助力超越百万基因组的遗传分析
bioRxiv. 2024 Jan 10:2024.01.09.574205. doi: 10.1101/2024.01.09.574205.
3
Improved VCF normalization for accurate VCF comparison.改进VCF标准化以实现准确的VCF比较。
Bioinformatics. 2017 Apr 1;33(7):964-970. doi: 10.1093/bioinformatics/btw748.
4
VCF-kit: assorted utilities for the variant call format.VCF工具包:用于变异调用格式的各种实用工具。
Bioinformatics. 2017 May 15;33(10):1581-1582. doi: 10.1093/bioinformatics/btx011.
5
cyvcf2: fast, flexible variant analysis with Python.cyvcf2:使用Python进行快速、灵活的变异分析。
Bioinformatics. 2017 Jun 15;33(12):1867-1869. doi: 10.1093/bioinformatics/btx057.
6
Variant Tool Chest: an improved tool to analyze and manipulate variant call format (VCF) files.变异工具工具箱:一种改进的工具,用于分析和操作变异调用格式 (VCF) 文件。
BMC Bioinformatics. 2014;15 Suppl 7(Suppl 7):S12. doi: 10.1186/1471-2105-15-S7-S12. Epub 2014 May 28.
7
re-Searcher: GUI-based bioinformatics tool for simplified genomics data mining of VCF files.再搜索者:用于简化VCF文件基因组学数据挖掘的基于图形用户界面的生物信息学工具。
PeerJ. 2021 May 3;9:e11333. doi: 10.7717/peerj.11333. eCollection 2021.
8
SeqArray-a storage-efficient high-performance data format for WGS variant calls.SeqArray——一种用于全基因组测序变异检测的存储高效的高性能数据格式。
Bioinformatics. 2017 Aug 1;33(15):2251-2257. doi: 10.1093/bioinformatics/btx145.
9
Isomorphic semantic mapping of variant call format (VCF2RDF).变异调用格式的同构语义映射(VCF2RDF)。
Bioinformatics. 2017 Feb 15;33(4):547-548. doi: 10.1093/bioinformatics/btw652.
10
Variant graph craft (VGC): a comprehensive tool for analyzing genetic variation and identifying disease-causing variants.变体图工艺(VGC):一种全面的分析遗传变异和识别致病变异的工具。
BMC Bioinformatics. 2024 Sep 3;25(1):288. doi: 10.1186/s12859-024-05875-7.

引用本文的文献

1
Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可供分析的VCF。
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf049.
2
Lessons from national biobank projects utilizing whole-genome sequencing for population-scale genomics.利用全基因组测序开展人群规模基因组学研究的国家生物样本库项目经验教训。
Genomics Inform. 2025 Mar 6;23(1):8. doi: 10.1186/s44342-025-00040-9.
3
One score to rule them all: regularized ensemble polygenic risk prediction with GWAS summary statistics.一分数统御一切:利用全基因组关联研究汇总统计数据进行正则化集成多基因风险预测

本文引用的文献

1
A harmonized public resource of deeply sequenced diverse human genomes.一个深度测序的多样化人类基因组的协调公共资源。
Genome Res. 2024 Jun 25;34(5):796-809. doi: 10.1101/gr.278378.123.
2
A genomic mutational constraint map using variation in 76,156 human genomes.基于 76156 个人类基因组的变异,绘制出基因组突变约束图谱。
Nature. 2024 Jan;625(7993):92-100. doi: 10.1038/s41586-023-06045-0. Epub 2023 Dec 6.
3
CHARR efficiently estimates contamination from DNA sequencing data.CHARR 可以有效地估计 DNA 测序数据中的污染。
bioRxiv. 2024 Dec 4:2024.11.27.625748. doi: 10.1101/2024.11.27.625748.
4
Diagnosing missed cases of spinal muscular atrophy in genome, exome, and panel sequencing data sets.在基因组、外显子组和基因组合测序数据集中诊断脊髓性肌萎缩症漏诊病例。
Genet Med. 2025 Apr;27(4):101336. doi: 10.1016/j.gim.2024.101336. Epub 2024 Dec 9.
5
Exome wide association study for blood lipids in 1,158,017 individuals from diverse populations.对来自不同人群的1,158,017名个体进行血脂外显子组全关联研究。
medRxiv. 2024 Sep 18:2024.09.17.24313718. doi: 10.1101/2024.09.17.24313718.
6
Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可用于分析的VCF。
bioRxiv. 2025 Feb 6:2024.06.11.598241. doi: 10.1101/2024.06.11.598241.
7
Diagnosing missed cases of spinal muscular atrophy in genome, exome, and panel sequencing datasets.在基因组、外显子组和基因panel测序数据集中诊断脊髓性肌萎缩症漏诊病例。
medRxiv. 2024 Jun 29:2024.02.11.24302646. doi: 10.1101/2024.02.11.24302646.
Am J Hum Genet. 2023 Dec 7;110(12):2068-2076. doi: 10.1016/j.ajhg.2023.10.011. Epub 2023 Nov 23.
4
Sparse allele vectors and the savvy software suite.稀疏等位基因向量和精明的软件套件。
Bioinformatics. 2021 Nov 18;37(22):4248-4250. doi: 10.1093/bioinformatics/btab378.
5
Accurate, scalable cohort variant calls using DeepVariant and GLnexus.使用DeepVariant和GLnexus进行准确、可扩展的队列变异检测。
Bioinformatics. 2021 Apr 5;36(24):5582-5589. doi: 10.1093/bioinformatics/btaa1081.
6
Sparse Project VCF: efficient encoding of population genotype matrices.稀疏项目 VCF:群体基因型矩阵的有效编码。
Bioinformatics. 2021 Apr 1;36(22-23):5537-5538. doi: 10.1093/bioinformatics/btaa1004.
7
The mutational constraint spectrum quantified from variation in 141,456 humans.从 141456 名人类个体的变异中量化的突变约束谱。
Nature. 2020 May;581(7809):434-443. doi: 10.1038/s41586-020-2308-7. Epub 2020 May 27.
8
Analysis of protein-coding genetic variation in 60,706 humans.对60706名人类的蛋白质编码基因变异进行分析。
Nature. 2016 Aug 18;536(7616):285-91. doi: 10.1038/nature19057.
9
Efficient genotype compression and analysis of large genetic-variation data sets.大型基因变异数据集的高效基因型压缩与分析
Nat Methods. 2016 Jan;13(1):63-5. doi: 10.1038/nmeth.3654. Epub 2015 Nov 9.
10
BGT: efficient and flexible genotype query across many samples.BGT:跨多个样本进行高效灵活的基因型查询。
Bioinformatics. 2016 Feb 15;32(4):590-2. doi: 10.1093/bioinformatics/btv613. Epub 2015 Oct 24.