Bayat Arash, Gaëta Bruno, Ignjatovic Aleksandar, Parameswaran Sri
Bioinformatics. 2017 Apr 1;33(7):964-970. doi: 10.1093/bioinformatics/btw748.
The Variant Call Format (VCF) is widely used to store data about genetic variation. Variant calling workflows detect potential variants in large numbers of short sequence reads generated by DNA sequencing and report them in VCF format. To evaluate the accuracy of variant callers, it is critical to correctly compare their output against a reference VCF file containing a gold standard set of variants. However, comparing VCF files is a complicated task as an individual genomic variant can be represented in several different ways and is therefore not necessarily reported in a unique way by different software.
We introduce a VCF normalization method called Best Alignment Normalisation (BAN) that results in more accurate VCF file comparison. BAN applies all the variations in a VCF file to the reference genome to create a sample genome, and then recalls the variants by aligning this sample genome back with the reference genome. Since the purpose of BAN is to get an accurate result at the time of VCF comparison, we define a better normalization method as the one resulting in less disagreement between the outputs of different VCF comparators.
The BAN Linux bash script along with required software are publicly available on https://sites.google.com/site/banadf16.
Supplementary data are available at Bioinformatics online.
变异调用格式(VCF)被广泛用于存储有关基因变异的数据。变异调用工作流程会在DNA测序产生的大量短序列读数中检测潜在变异,并以VCF格式报告这些变异。为了评估变异调用程序的准确性,将其输出与包含一组黄金标准变异的参考VCF文件进行正确比较至关重要。然而,比较VCF文件是一项复杂的任务,因为单个基因组变异可以用几种不同的方式表示,因此不同软件不一定以唯一的方式报告。
我们引入了一种称为最佳比对归一化(BAN)的VCF归一化方法,该方法可实现更准确的VCF文件比较。BAN将VCF文件中的所有变异应用于参考基因组以创建样本基因组,然后通过将此样本基因组与参考基因组重新比对来回溯变异。由于BAN的目的是在VCF比较时获得准确的结果,因此我们将一种更好的归一化方法定义为在不同VCF比较器的输出之间产生较少不一致的方法。
BAN Linux bash脚本以及所需软件可在https://sites.google.com/site/banadf16上公开获取。
补充数据可在《生物信息学》在线版上获取。