Zhao Xuefang, Weber Alexandra M, Mills Ryan E
Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109, USA.
Department of Human Genetics, University of Michigan, 1241 Catherine St, Ann Arbor, MI 48109, USA.
Gigascience. 2017 Aug 1;6(8):1-9. doi: 10.1093/gigascience/gix061.
Although numerous algorithms have been developed to identify structural variations (SVs) in genomic sequences, there is a dearth of approaches that can be used to evaluate their results. This is significant as the accurate identification of structural variation is still an outstanding but important problem in genomics. The emergence of new sequencing technologies that generate longer sequence reads can, in theory, provide direct evidence for all types of SVs regardless of the length of the region through which it spans. However, current efforts to use these data in this manner require the use of large computational resources to assemble these sequences as well as visual inspection of each region. Here we present VaPoR, a highly efficient algorithm that autonomously validates large SV sets using long-read sequencing data. We assessed the performance of VaPoR on SVs in both simulated and real genomes and report a high-fidelity rate for overall accuracy across different levels of sequence depths. We show that VaPoR can interrogate a much larger range of SVs while still matching existing methods in terms of false positive validations and providing additional features considering breakpoint precision and predicted genotype. We further show that VaPoR can run quickly and efficiency without requiring a large processing or assembly pipeline. VaPoR provides a long read-based validation approach for genomic SVs that requires relatively low read depth and computing resources and thus will provide utility with targeted or low-pass sequencing coverage for accurate SV assessment. The VaPoR Software is available at: https://github.com/mills-lab/vapor.
尽管已经开发了许多算法来识别基因组序列中的结构变异(SVs),但缺乏可用于评估其结果的方法。这一点很重要,因为结构变异的准确识别仍然是基因组学中一个突出但重要的问题。理论上,能够生成更长序列读数的新测序技术的出现,可以为所有类型的SVs提供直接证据,而不管其跨越区域的长度如何。然而,目前以这种方式使用这些数据的努力需要使用大量计算资源来组装这些序列,以及对每个区域进行目视检查。在这里,我们介绍了VaPoR,这是一种高效算法,可使用长读测序数据自主验证大型SV集。我们评估了VaPoR在模拟和真实基因组中的SVs性能,并报告了在不同序列深度水平下总体准确性的高保真率。我们表明,VaPoR可以检测到范围更广的SVs,同时在假阳性验证方面仍与现有方法相匹配,并在断点精度和预测基因型方面提供额外的特征。我们进一步表明,VaPoR可以快速高效地运行,而无需大型处理或组装流程。VaPoR为基因组SVs提供了一种基于长读的验证方法,该方法需要相对较低的读深度和计算资源,因此将为准确的SV评估提供有针对性或低通量测序覆盖的实用工具。VaPoR软件可在以下网址获取:https://github.com/mills-lab/vapor 。