Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States.
Department of Quantitative and Computational Biology, Dana and David Dornsife College of Letters, Arts and Sciences University of Southern California, 3540 S Figueroa St, Los Angeles, California 90089, United States.
Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae462.
Structural variation (SV) refers to insertions, deletions, inversions, and duplications in human genomes. SVs are present in approximately 1.5% of the human genome. Still, this small subset of genetic variation has been implicated in the pathogenesis of psoriasis, Crohn's disease and other autoimmune disorders, autism spectrum and other neurodevelopmental disorders, and schizophrenia. Since identifying structural variants is an important problem in genetics, several specialized computational techniques have been developed to detect structural variants directly from sequencing data. With advances in whole-genome sequencing (WGS) technologies, a plethora of SV detection methods have been developed. However, dissecting SVs from WGS data remains a challenge, with the majority of SV detection methods prone to a high false-positive rate, and no existing method able to precisely detect a full range of SVs present in a sample. Previous studies have shown that none of the existing SV callers can maintain high accuracy across various SV lengths and genomic coverages. Here, we report an integrated structural variant calling framework, Variant Identification and Structural Variant Analysis (VISTA), that leverages the results of individual callers using a novel and robust filtering and merging algorithm. In contrast to existing consensus-based tools which ignore the length and coverage, VISTA overcomes this limitation by executing various combinations of top-performing callers based on variant length and genomic coverage to generate SV events with high accuracy. We evaluated the performance of VISTA on comprehensive gold-standard datasets across varying organisms and coverage. We benchmarked VISTA using the Genome-in-a-Bottle gold standard SV set, haplotype-resolved de novo assemblies from the Human Pangenome Reference Consortium, along with an in-house polymerase chain reaction (PCR)-validated mouse gold standard set. VISTA maintained the highest F1 score among top consensus-based tools measured using a comprehensive gold standard across both mouse and human genomes. VISTA also has an optimized mode, where the calls can be optimized for precision or recall. VISTA-optimized can attain 100% precision and the highest sensitivity among other variant callers. In conclusion, VISTA represents a significant advancement in structural variant calling, offering a robust and accurate framework that outperforms existing consensus-based tools and sets a new standard for SV detection in genomic research.
结构变异 (SV) 是指人类基因组中的插入、缺失、倒位和重复。SV 约占人类基因组的 1.5%。尽管这一小部分遗传变异与银屑病、克罗恩病和其他自身免疫性疾病、自闭症谱系和其他神经发育障碍以及精神分裂症的发病机制有关。由于鉴定结构变异是遗传学中的一个重要问题,因此已经开发了几种专门的计算技术来直接从测序数据中检测结构变异。随着全基因组测序 (WGS) 技术的进步,已经开发了大量的 SV 检测方法。然而,从 WGS 数据中解析 SV 仍然是一个挑战,大多数 SV 检测方法容易出现高假阳性率,并且没有现有的方法能够精确检测样本中存在的全范围的 SV。以前的研究表明,现有的 SV 调用者都不能在各种 SV 长度和基因组覆盖范围内保持高精度。在这里,我们报告了一种集成的结构变异调用框架,即变体识别和结构变异分析 (VISTA),它利用了各个调用者的结果,使用一种新颖而强大的过滤和合并算法。与忽略长度和覆盖范围的现有基于共识的工具不同,VISTA 通过根据变体长度和基因组覆盖范围执行各种表现最佳的调用者的组合来克服这一限制,从而以高精度生成 SV 事件。我们在不同的生物体和覆盖范围内的综合金标准数据集上评估了 VISTA 的性能。我们使用基因组瓶金标准 SV 集、人类泛基因组参考联盟的单倍型解析从头组装以及内部聚合酶链反应 (PCR) 验证的小鼠金标准集来对 VISTA 进行基准测试。VISTA 在使用综合金标准对小鼠和人类基因组进行测量时,在基于共识的顶级工具中保持了最高的 F1 分数。VISTA 还具有优化模式,可以针对精度或召回率优化调用。VISTA-optimized 可以在其他变体调用者中达到 100%的精度和最高的灵敏度。总之,VISTA 代表了结构变异调用方面的重大进展,提供了一种强大而准确的框架,优于现有的基于共识的工具,并为基因组研究中的 SV 检测设定了新标准。