Department of Computer Science.
Department of Biomedical Engineering.
Bioinformatics. 2020 Jun 1;36(12):3712-3718. doi: 10.1093/bioinformatics/btaa265.
Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score.
Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these 'gold standard' Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-maximal exact match and vg to align more reads correctly.
Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license.
Supplementary data are available at Bioinformatics online.
读对齐是现代基因组学的许多方面的核心。大多数比对器使用启发式算法来加速处理,但这些启发式算法可能无法找到读取的最佳比对。比对准确性通常通过模拟读取进行测量;然而,模拟的位置可能不是具有最佳比对分数的(唯一)位置。
Vargas 实现了一种无启发式算法,该算法可保证找到线性或图形基因组中真实测序读取的最高得分比对。通过半全局和局部比对模式以及仿射间隙和质量缩放错配罚分,它可以实现常用比对器的评分函数,以计算最佳比对。虽然这在计算上很密集,但 Vargas 使用多核并行化和矢量化(SIMD)指令来实现实用的大量读取的最佳对齐,实现了每秒 4560 亿个细胞更新的最大速度。我们展示了如何通过在 Bowtie 2、BWA-maximal exact match 和 vg 中优化命令行参数来使用这些“黄金标准”Vargas 比对来提高启发式比对准确性,从而正确对齐更多读取。
在 MIT 许可证下,可在 https://github.com/langmead-lab/vargas 网址上获得用 C++实现的源代码和编译二进制版本。
补充数据可在生物信息学在线获得。