Salavert José, Tomás Andrés, Tárraga Joaquín, Medina Ignacio, Dopazo Joaquín, Blanquer Ignacio
GRyCAP department of I3M, Universitat Politècnica de València, Building 8B, Camino de vera s/n, Valencia, 46022, Spain.
Bioinformatics department of Centro de Investigación Príncipe Felipe, Autopista del Saler 16, Valencia, 46012, Spain.
BMC Bioinformatics. 2015 Jan 28;16:18. doi: 10.1186/s12859-014-0438-3.
Short sequence mapping methods for Next Generation Sequencing consist on a combination of seeding techniques followed by local alignment based on dynamic programming approaches. Most seeding algorithms are based on backward search alignment, using the Burrows Wheeler Transform, the Ferragina and Manzini Index or Suffix Arrays. All these backward search algorithms have excellent performance, but their computational cost highly increases when allowing errors. In this paper, we discuss an inexact mapping algorithm based on pruning strategies for search tree exploration over genomic data.
The proposed algorithm achieves a 13x speed-up over similar algorithms when allowing 6 base errors, including insertions, deletions and mismatches. This algorithm can deal with 400 bps reads with up to 9 errors in a high quality Illumina dataset. In this example, the algorithm works as a preprocessor that reduces by 55% the number of reads to be aligned. Depending on the aligner the overall execution time is reduced between 20-40%.
Although not intended as a complete sequence mapping tool, the proposed algorithm could be used as a preprocessing step to modern sequence mappers. This step significantly reduces the number reads to be aligned, accelerating overall alignment time. Furthermore, this algorithm could be used for accelerating the seeding step of already available sequence mappers. In addition, an out-of-core index has been implemented for working with large genomes on systems without expensive memory configurations.
用于下一代测序的短序列映射方法是由种子技术与基于动态规划方法的局部比对相结合组成。大多数种子算法基于反向搜索比对,使用布罗伊登-弗莱彻-戈德法布-香农算法(Burrows Wheeler Transform)、费拉吉纳和曼齐尼索引(Ferragina and Manzini Index)或后缀数组(Suffix Arrays)。所有这些反向搜索算法都具有出色的性能,但在允许错误时其计算成本会大幅增加。在本文中,我们讨论一种基于剪枝策略的不精确映射算法,用于对基因组数据进行搜索树探索。
当允许6个碱基错误(包括插入、缺失和错配)时,所提出的算法比类似算法的速度提高了13倍。该算法可以处理高质量Illumina数据集中长度为400个碱基对且最多有9个错误的读段。在这个例子中,该算法作为预处理器,将需要比对的读段数量减少了55%。根据比对器的不同,总体执行时间减少了20% - 40%。
尽管该算法并非旨在作为一个完整的序列映射工具,但可作为现代序列映射器的预处理步骤。这一步骤显著减少了需要比对的读段数量,加快了总体比对时间。此外,该算法可用于加速现有序列映射器的种子步骤。此外,还实现了一种核外索引,以便在没有昂贵内存配置的系统上处理大型基因组。