svclassify:一种建立基准结构变异调用的方法。
svclassify: a method to establish benchmark structural variant calls.
作者信息
Parikh Hemang, Mohiyuddin Marghoob, Lam Hugo Y K, Iyer Hariharan, Chen Desu, Pratt Mark, Bartha Gabor, Spies Noah, Losert Wolfgang, Zook Justin M, Salit Marc
机构信息
Genome-Scale Measurements Group, Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8313, Gaithersburg, MD, 20899, USA.
Dakota Consulting Inc., 1110 Bonifant Street, Suite 310, Silver Spring, MD, 20910, USA.
出版信息
BMC Genomics. 2016 Jan 16;17:64. doi: 10.1186/s12864-016-2366-2.
BACKGROUND
The human genome contains variants ranging in size from small single nucleotide polymorphisms (SNPs) to large structural variants (SVs). High-quality benchmark small variant calls for the pilot National Institute of Standards and Technology (NIST) Reference Material (NA12878) have been developed by the Genome in a Bottle Consortium, but no similar high-quality benchmark SV calls exist for this genome. Since SV callers output highly discordant results, we developed methods to combine multiple forms of evidence from multiple sequencing technologies to classify candidate SVs into likely true or false positives. Our method (svclassify) calculates annotations from one or more aligned bam files from many high-throughput sequencing technologies, and then builds a one-class model using these annotations to classify candidate SVs as likely true or false positives.
RESULTS
We first used pedigree analysis to develop a set of high-confidence breakpoint-resolved large deletions. We then used svclassify to cluster and classify these deletions as well as a set of high-confidence deletions from the 1000 Genomes Project and a set of breakpoint-resolved complex insertions from Spiral Genetics. We find that likely SVs cluster separately from likely non-SVs based on our annotations, and that the SVs cluster into different types of deletions. We then developed a supervised one-class classification method that uses a training set of random non-SV regions to determine whether candidate SVs have abnormal annotations different from most of the genome. To test this classification method, we use our pedigree-based breakpoint-resolved SVs, SVs validated by the 1000 Genomes Project, and assembly-based breakpoint-resolved insertions, along with semi-automated visualization using svviz.
CONCLUSIONS
We find that candidate SVs with high scores from multiple technologies have high concordance with PCR validation and an orthogonal consensus method MetaSV (99.7 % concordant), and candidate SVs with low scores are questionable. We distribute a set of 2676 high-confidence deletions and 68 high-confidence insertions with high svclassify scores from these call sets for benchmarking SV callers. We expect these methods to be particularly useful for establishing high-confidence SV calls for benchmark samples that have been characterized by multiple technologies.
背景
人类基因组包含大小各异的变异,从单个小的单核苷酸多态性(SNP)到大型结构变异(SV)。美国国家标准与技术研究院(NIST)参考材料(NA12878)的高质量基准小变异位点已由基因组瓶子联盟开发出来,但该基因组尚无类似的高质量基准SV位点。由于SV检测工具输出的结果高度不一致,我们开发了一些方法,将来自多种测序技术的多种证据形式结合起来,将候选SV分类为可能的真阳性或假阳性。我们的方法(svclassify)从多种高通量测序技术的一个或多个比对的bam文件中计算注释,然后使用这些注释构建一个单类模型,将候选SV分类为可能的真阳性或假阳性。
结果
我们首先利用家系分析开发了一组高可信度的断点解析大缺失。然后,我们使用svclassify对这些缺失以及来自千人基因组计划的一组高可信度缺失和来自螺旋遗传学公司的一组断点解析复杂插入进行聚类和分类。我们发现,基于我们的注释,可能的SV与可能的非SV分别聚类,并且SV聚类为不同类型的缺失。然后,我们开发了一种有监督的单类分类方法,该方法使用一组随机的非SV区域训练集来确定候选SV是否具有与基因组大部分区域不同的异常注释。为了测试这种分类方法,我们使用基于家系的断点解析SV、经千人基因组计划验证的SV和基于组装的断点解析插入,以及使用svviz进行的半自动可视化。
结论
我们发现,来自多种技术的高分候选SV与PCR验证和正交一致性方法MetaSV具有高度一致性(一致性为99.7%),低分候选SV则存在疑问。我们从这些调用集中分发了一组2676个具有高svclassify分数的高可信度缺失和68个高可信度插入,用于基准测试SV检测工具。我们预计这些方法对于为已通过多种技术表征的基准样本建立高可信度SV调用特别有用。