tarSVM：使用支持向量机提高基于微流控PCR的靶向新一代测序得出的变异检测准确性。

tarSVM: Improving the accuracy of variant calls derived from microfluidic PCR-based targeted next generation sequencing using a support vector machine.

作者信息

Gillies Christopher E, Otto Edgar A, Vega-Warner Virginia, Robertson Catherine C, Sanna-Cherchi Simone, Gharavi Ali, Crawford Brendan, Bhimma Rajendra, Winkler Cheryl, Kang Hyun Min, Sampson Matthew G

机构信息

Department of Pediatrics-Nephrology, University of Michigan School of Medicine, Ann Arbor, MI, USA.

Department of Internal Medicine-Nephrology, University of Michigan School of Medicine, Ann Arbor, MI, USA.

出版信息

BMC Bioinformatics. 2016 Jun 10;17(1):233. doi: 10.1186/s12859-016-1108-4.

DOI:10.1186/s12859-016-1108-4

PMID:27287006

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4902911/

Abstract

BACKGROUND

Targeted sequencing of discrete gene sets is a cost effective strategy to screen subjects for monogenic forms of disease. One method to achieve this pairs microfluidic PCR with next generation sequencing. The PCR step of this pipeline creates challenges in accurate variant calling. This includes that most reads targeting a specific exon are duplicates that have been amplified from the PCR step. To reduce false positive variant calls from these experiments, previous studies have used threshold-based filtering of alternative allele depth ratio and manual inspection of the alignments. However even after manual inspection and filtering, many variants fail to be validated via Sanger sequencing. To improve the accuracy of variant calling from these experiments, we are challenged to design a variant filtering strategy that sufficiently models microfluidic PCR-specific issues.

RESULTS

We developed an open source variant filtering pipeline, targeted sequencing support vector machine ("tarSVM"), that uses a Support Vector Machine (SVM) and a new score the normalized allele dosage test to identify high quality variants from microfluidic PCR data. tarSVM maximizes training knowledge by selecting variants that are likely true and likely false variants by incorporating knowledge from the 1000 Genomes and the Exome Aggregation Consortium projects. tarSVM improves on previous approaches by synthesizing variant features from the Genome Analysis Toolkit and allele dosage information. We compared the accuracy of tarSVM versus existing variant quality filtering strategies on two cohorts (n = 474 and n = 1152), and validated our method on a third cohort (n = 75). In the first cohort, our method achieved 84.5 % accuracy of predicting whether or not a variant would be validated with Sanger sequencing versus 78.8 % for the second most accurate method. In the second cohort, our method had an accuracy of 73.3 %, versus 61.5 % for the second best method. Finally, our method had a false discovery rate of 5 % for the validation cohort.

CONCLUSIONS

tarSVM increases the accuracy of variant calling when using microfluidic PCR based targeted sequencing approaches. This results in higher confidence downstream analyses, and ultimately reduces the costs Sanger validation. Our approach is less labor intensive than existing approaches, and is available as an open source pipeline for read trimming, aligning, variant calling, and variant quality filtering on GitHub at https://github.com/christopher-gillies/TargetSpecificGATKSequencingPipeline .

摘要

背景

对离散基因集进行靶向测序是筛查单基因疾病形式受试者的一种经济有效的策略。实现这一目标的一种方法是将微流控PCR与下一代测序相结合。该流程中的PCR步骤在准确的变异检测中带来了挑战。这包括大多数靶向特定外显子的 reads 是在PCR步骤中扩增出来的重复序列。为了减少这些实验中假阳性变异检测结果，先前的研究使用了基于阈值的替代等位基因深度比过滤和比对的人工检查。然而，即使经过人工检查和过滤，许多变异仍无法通过桑格测序进行验证。为了提高这些实验中变异检测的准确性，我们面临着设计一种变异过滤策略的挑战，该策略能够充分模拟微流控PCR特有的问题。

结果

我们开发了一种开源变异过滤流程，即靶向测序支持向量机（“tarSVM”），它使用支持向量机（SVM）和一种新的评分——标准化等位基因剂量测试，从微流控PCR数据中识别高质量变异。tarSVM通过纳入来自千人基因组计划和外显子聚合联盟项目的知识，选择可能为真和可能为假的变异，从而最大化训练知识。tarSVM通过整合来自基因组分析工具包的变异特征和等位基因剂量信息，改进了先前的方法。我们在两个队列（n = 474和n = 1152）中比较了tarSVM与现有变异质量过滤策略的准确性，并在第三个队列（n = 75）中验证了我们的方法。在第一个队列中，我们的方法在预测变异是否会通过桑格测序验证方面的准确率达到了84.5%，而第二准确的方法为78.8%。在第二个队列中，我们的方法准确率为73.3%，而第二好的方法为61.5%。最后，我们的方法在验证队列中的错误发现率为5%。

结论

当使用基于微流控PCR的靶向测序方法时，tarSVM提高了变异检测的准确性。这导致下游分析具有更高的可信度，并最终降低了桑格验证的成本。我们的方法比现有方法所需的人力更少，并且可作为一个开源流程在GitHub上获取，用于读段修剪、比对、变异检测和变异质量过滤，网址为https://github.com/christopher-gillies/TargetSpecificGATKSequencingPipeline 。