Poterba Timothy, Vittal Christopher, King Daniel, Goldstein Daniel, Goldstein Jacqueline I, Schultz Patrick, Karczewski Konrad J, Seed Cotton, Neale Benjamin M
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, United States.
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae746.
The Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150 000 genome VCF would occupy 900 TiB, making it costly and complicated to produce, analyze, and store. The issue stems from VCF's requirement to densely represent both reference-genotypes and allele-indexed arrays. These requirements lead to unnecessary data duplication and, ultimately, very large files.
To address these challenges, we introduce the Scalable Variant Call Representation (SVCR). This representation reduces file sizes by ensuring they scale linearly with samples. SVCR's linear scaling relies on two techniques, both necessary for linearity: local allele indices and reference blocks, which were first introduced by the Genomic Variant Call Format. SVCR is also lossless and mergeable, allowing for N + 1 and N + K incremental joint-calling. We present two implementations of SVCR: SVCR-VCF, which encodes SVCR in VCF format, and VDS, which uses Hail's native format. Our experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files. We also discuss the VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis. SVCR, and VDS in particular, ensure the scientific community can generate, analyze, and disseminate genetics datasets with millions of samples.
变异调用格式(VCF)在基因组测序中被广泛使用,但扩展性较差。例如,我们估计一个包含150000个基因组的VCF将占用900 TiB,这使得其生成、分析和存储成本高昂且复杂。问题源于VCF需要密集表示参考基因型和等位基因索引数组。这些要求导致了不必要的数据重复,最终产生了非常大的文件。
为应对这些挑战,我们引入了可扩展变异调用表示(SVCR)。这种表示通过确保文件大小随样本数量线性扩展来减小文件大小。SVCR的线性扩展依赖于两种技术,这两种技术对于线性扩展都是必需的:局部等位基因索引和参考块,它们最初由基因组变异调用格式引入。SVCR也是无损且可合并的,允许进行N + 1和N + K增量联合调用。我们展示了SVCR的两种实现方式:SVCR-VCF,它以VCF格式编码SVCR;以及VDS,它使用Hail的原生格式。我们的实验证实了SVCR-VCF和VDS的线性可扩展性,这与标准VCF文件的超线性增长形成对比。我们还讨论了VDS合并器,这是一个用于从GVCF生成VDS的可扩展开源工具,以及VDS能够实现快速数据分析的独特功能。SVCR,特别是VDS,确保了科学界能够生成、分析和传播包含数百万样本的遗传学数据集。