Medical Population Genetics Program, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.
Bioinformatics. 2014 Oct 15;30(20):2843-51. doi: 10.1093/bioinformatics/btu356. Epub 2014 Jun 27.
Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods.
We made 10 single nucleotide polymorphism and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10-15 kb, but the error rate of post-filtered calls is reduced to 1 in 100-200 kb without significant compromise on the sensitivity.
BWA-MEM alignment and raw variant calls are available at http://bit.ly/1g8XqRt scripts and miscellaneous data at https://github.com/lh3/varcmp.
Supplementary data are available at Bioinformatics online.
全基因组高覆盖率测序已广泛应用于个人和癌症基因组学以及各个研究领域。然而,由于缺乏无偏的全基因组真实数据集,即使在评估变异调用方法方面付出了巨大努力,变异调用的全局错误率和主要因果人工制品仍然不清楚。
我们使用两种读取映射器和五种变异调用者,在单倍体人类基因组和相似覆盖率的二倍体基因组上分别制作了 10 个单核苷酸多态性和 INDEL 调用集。通过调查单倍体基因组中假杂合子调用,我们确定了低复杂度区域中的错误重比对和相对于样本的不完整参考基因组是这两个主要错误源,这需要在这两个方面继续改进。我们估计原始基因型调用的错误率高达每 10-15kb 一个,但过滤后调用的错误率降低到每 100-200kb 一个,而不会对灵敏度造成显著影响。
BWA-MEM 比对和原始变异调用可在 http://bit.ly/1g8XqRt 上获得;脚本和各种数据可在 https://github.com/lh3/varcmp 上获得。
补充数据可在《生物信息学》在线获得。