mlin.net LLC, San Jose, CA 95113, USA.
Department of Regeneron Pharmaceuticals, Inc., Regeneron Genetics Center, Tarrytown, NY 10591, USA.
Bioinformatics. 2021 Apr 1;36(22-23):5537-5538. doi: 10.1093/bioinformatics/btaa1004.
Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10× size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts.
Apache-licensed reference implementation: github.com/mlin/spVCF.
Supplementary data are available at Bioinformatics online.
变体调用格式(VCF)是人群测序中用于种系基因型的主要表示方法,但随着更大规模队列的测序和更多罕见变异的发现,其大小增长迅速。我们提出了稀疏项目 VCF(spVCF),这是 VCF 的一种演进,采用了明智的熵减少和游程长度编码,为现代研究提供了 >10 倍的大小缩减,而实际上几乎没有信息损失。spVCF 与 VCF 高效地互操作,包括基于 tabix 的随机访问。我们使用 DiscovEHR 和 UK Biobank 全外显子组测序队列证明了它的有效性。
Apache 许可的参考实现:github.com/mlin/spVCF。
补充数据可在 Bioinformatics 在线获得。