slaMEM：使用采样 LCP 数组进行高效的最大精确匹配检索。

slaMEM: efficient retrieval of maximal exact matches using a sampled LCP array.

机构信息

Knowledge Discovery and Bioinformatics Group (KDBIO), Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento (INESC-ID), Rua Alves Redol, 9, 1000-029 Lisbon and Department of Computer Science and Engineering, Instituto Superior Técnico (IST) - Universidade de Lisboa, Avenida Rovisco Pais, 1, 1049-001 Lisbon, Portugal.

出版信息

Bioinformatics. 2014 Feb 15;30(4):464-71. doi: 10.1093/bioinformatics/btt706. Epub 2013 Dec 11.

DOI:10.1093/bioinformatics/btt706

PMID:24336412

Abstract

MOTIVATION

Maximal exact matches, or just MEMs, are a powerful tool in the context of multiple sequence alignment and approximate string matching. The most efficient algorithms to collect them are based on compressed indexes that rely on longest common prefix array-centered data structures. However, their space-efficient representations make use of encoding techniques that are expensive from a computational point of view. With the deluge of data generated by high-throughput sequencing, new approaches need to be developed to deal with larger genomic sequences.

RESULTS

In this work, we have developed a new longest common prefix array-sampled representation, optimized to work with the backward search method inherently used by the FM-Index. Unlike previous implementations that sacrifice running time to have smaller space, ours lead to both a fast and a space-efficient approach. This implementation was used by the new software slaMEM, developed to efficiently retrieve MEMs. The results show that the new algorithm is competitive against existing state-of-the-art approaches.

AVAILABILITY AND IMPLEMENTATION

The software is implemented in C and is operating system independent. The source code is freely available for download at http://github.com/fjdf/slaMEM/ under the GPLv3 license.

摘要

动机

最大精确匹配（MEMs）是多序列比对和近似字符串匹配背景下的强大工具。收集它们的最高效算法基于依赖于最长公共前缀数组中心数据结构的压缩索引。然而，它们节省空间的表示形式利用了从计算角度来看昂贵的编码技术。随着高通量测序产生的数据泛滥，需要开发新的方法来处理更大的基因组序列。

结果

在这项工作中，我们开发了一种新的最长公共前缀数组采样表示，该表示经过优化，可与 FM-Index 固有使用的反向搜索方法配合使用。与以前的实现不同，这些实现牺牲运行时间来获得更小的空间，而我们的实现则同时实现了快速和节省空间的方法。新软件 slaMEM 采用了这种新的实现，用于有效地检索 MEMs。结果表明，新算法具有竞争力，可与现有最先进的方法相媲美。