Gézsi András, Bolgár Bence, Marx Péter, Sarkozy Peter, Szalai Csaba, Antal Péter
Department of Genetics, Cell- and Immunobiology, Semmelweis University, Nagyvárad tér 4, Budapest, H-1089, Hungary.
Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
BMC Genomics. 2015 Oct 28;16:875. doi: 10.1186/s12864-015-2050-y.
The low concordance between different variant calling methods still poses a challenge for the wide-spread application of next-generation sequencing in research and clinical practice. A wide range of variant annotations can be used for filtering call sets in order to improve the precision of the variant calls, but the choice of the appropriate filtering thresholds is not straightforward. Variant quality score recalibration provides an alternative solution to hard filtering, but it requires large-scale, genomic data.
We evaluated germline variant calling pipelines based on BWA and Bowtie 2 aligners in combination with GATK UnifiedGenotyper, GATK HaplotypeCaller, FreeBayes and SAMtools variant callers, using simulated and real benchmark sequencing data (NA12878 with Illumina Platinum Genomes). We argue that these pipelines are not merely discordant, but they extract complementary useful information. We introduce VariantMetaCaller to test the hypothesis that the automated fusion of measurement related information allows better performance than the recommended hard-filtering settings or recalibration and the fusion of the individual call sets without using annotations. VariantMetaCaller uses Support Vector Machines to combine multiple information sources generated by variant calling pipelines and estimates probabilities of variants. This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes. We also demonstrated that VariantMetaCaller supports a quantitative, precision based filtering of variants under wider conditions. Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision. Precision then can be directly translated to the number of true called variants, or equivalently, to the number of false calls, which allows finding problem-specific balance between sensitivity and precision.
VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used. VariantMetaCaller is freely available at http://bioinformatics.mit.bme.hu/VariantMetaCaller .
不同变异检测方法之间的低一致性仍然对下一代测序技术在研究和临床实践中的广泛应用构成挑战。可以使用多种变异注释来过滤检测集,以提高变异检测的精度,但选择合适的过滤阈值并非易事。变异质量分数重新校准为硬过滤提供了一种替代解决方案,但它需要大规模的基因组数据。
我们使用模拟和真实的基准测序数据(来自Illumina Platinum Genomes的NA12878),评估了基于BWA和Bowtie 2比对器,并结合GATK UnifiedGenotyper、GATK HaplotypeCaller、FreeBayes和SAMtools变异检测工具的种系变异检测流程。我们认为这些流程不仅不一致,而且它们提取了互补的有用信息。我们引入了VariantMetaCaller来检验这样一个假设:与测量相关信息的自动融合比推荐的硬过滤设置或重新校准以及不使用注释的单个检测集融合具有更好的性能。VariantMetaCaller使用支持向量机来组合变异检测流程生成的多个信息源,并估计变异的概率。在从几百千碱基到整个外显子组的所有目标区域大小中,这种新方法的灵敏度和精度都显著高于单个变异检测工具。我们还证明了VariantMetaCaller在更广泛的条件下支持基于精度的变异定量过滤。具体而言,计算出的变异概率可用于对变异进行排序,对于给定的阈值,概率可用于估计精度。然后精度可以直接转化为真正检测到的变异数量,或者等效地转化为错误检测数量,可以在灵敏度和精度之间找到针对特定问题的平衡。
VariantMetaCaller可应用于小目标区域和整个外显子组,也可用于尚未有高精度变异检测集的生物体的情况,因此在无法使用变异质量分数重新校准的情况下,它可以成为硬过滤的可行替代方法。VariantMetaCaller可在http://bioinformatics.mit.bme.hu/VariantMetaCaller免费获取。