两种用于查找序列间近似重叠的高效技术。

Two Efficient Techniques to Find Approximate Overlaps between Sequences.

作者信息

Haj Rachid Maan

机构信息

Qatar University, P.O. Box 2713, Doha, Qatar.

出版信息

Biomed Res Int. 2017;2017:2731385. doi: 10.1155/2017/2731385. Epub 2017 Feb 15.

DOI:10.1155/2017/2731385

PMID:28293632

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5331309/

Abstract

The next-generation sequencing (NGS) technology outputs a huge number of sequences (reads) that require further processing. After applying prefiltering techniques in order to eliminate redundancy and to correct erroneous reads, an overlap-based assembler typically finds the longest exact suffix-prefix match between each ordered pair of the input reads. However, another trend has been evolving for the purpose of solving an approximate version of the overlap problem. The main benefit of this direction is the ability to skip time-consuming error-detecting techniques which are applied in the prefiltering stage. In this work, we present and compare two techniques to solve the approximate overlap problem. The first adapts a compact prefix tree to efficiently solve the approximate all-pairs suffix-prefix problem, while the other utilizes a well-known principle, namely, the pigeonhole principle, to identify a potential overlap match in order to ultimately solve the same problem. Our results show that our solution using the pigeonhole principle has better space and time consumption over an FM-based solution, while our solution based on prefix tree has the best space consumption between all three solutions. The number of mismatches (hamming distance) is used to define the approximate matching between strings in our work.

摘要

下一代测序（NGS）技术会输出大量需要进一步处理的序列（读段）。在应用预过滤技术以消除冗余并纠正错误读段之后，基于重叠的组装器通常会在输入读段的每一对有序读段之间找到最长的精确后缀-前缀匹配。然而，为了解决重叠问题的近似版本，另一种趋势正在发展。这个方向的主要好处是能够跳过在预过滤阶段应用的耗时的错误检测技术。在这项工作中，我们提出并比较了两种解决近似重叠问题的技术。第一种技术采用紧凑前缀树来高效解决近似全对后缀-前缀问题，而另一种技术利用一个著名的原理，即鸽巢原理，来识别潜在的重叠匹配，以便最终解决相同的问题。我们的结果表明，我们使用鸽巢原理的解决方案在空间和时间消耗方面优于基于FM的解决方案，而我们基于前缀树的解决方案在所有三种解决方案中具有最佳的空间消耗。在我们的工作中，错配数（汉明距离）用于定义字符串之间的近似匹配。