EMBL Heidelberg, Genome Biology Unit, Meyerhofstr. 1, 69117 Heidelberg, Germany.
Bioinformatics. 2014 May 15;30(10):1464-6. doi: 10.1093/bioinformatics/btu026. Epub 2014 Jan 21.
As applications of genome sequencing, including exomes and whole genomes, are expanding, there is a need for analysis tools that are scalable to large sets of samples and/or ultra-deep coverage. Many current tool chains are based on the widely used file formats BAM and VCF or VCF-derivatives. However, for some desirable analyses, data management with these formats creates substantial implementation overhead, and much time is spent parsing files and collating data. We observe that a tally data structure, i.e. the table of counts of nucleotides × samples × strands × genomic positions, provides a reasonable intermediate level of abstraction for many genomics analyses, including single nucleotide variant (SNV) and InDel calling, copy-number estimation and mutation spectrum analysis. Here we present h5vc, a data structure and associated software for managing tallies. The software contains functionality for creating tallies from BAM files, flexible and scalable data visualization, data quality assessment, computing statistics relevant to variant calling and other applications. Through the simplicity of its API, we envision making low-level analysis of large sets of genome sequencing data accessible to a wider range of researchers.
The package H5VC for the statistical environment R is available through the Bioconductor project. The HDF5 system is used as the core of our implementation.
Supplementary data are available at Bioinformatics online.
随着基因组测序应用(包括外显子组和全基因组)的扩展,我们需要能够扩展到大量样本和/或超高深度覆盖的分析工具。许多当前的工具链基于广泛使用的 BAM 和 VCF 文件格式或其衍生格式。然而,对于某些理想的分析,使用这些格式进行数据管理会产生大量的实现开销,并且需要花费大量时间解析文件和整理数据。我们观察到,计数数据结构(即核苷酸×样本×链×基因组位置的计数表)为许多基因组学分析(包括单核苷酸变异(SNV)和插入缺失(InDel)调用、拷贝数估计和突变谱分析)提供了一个合理的中间抽象级别。在这里,我们提出了 h5vc,这是一种用于管理计数数据结构的软件。该软件包含了从 BAM 文件创建计数数据结构、灵活且可扩展的数据可视化、数据质量评估、计算与变体调用和其他应用程序相关的统计信息的功能。通过其 API 的简单性,我们设想可以让更多的研究人员能够访问大规模基因组测序数据的底层分析。
用于统计环境 R 的 H5VC 软件包可通过 Bioconductor 项目获得。HDF5 系统被用作我们实现的核心。
补充数据可在 Bioinformatics 在线获取。