使用后缀树发现DNA序列中的复杂重复模式。

Using suffix tree to discover complex repetitive patterns in DNA sequences.

作者信息

He Dan

机构信息

Dept. of Comput. Sci., Vermont Univ., Burlington, VT 05405, USA.

出版信息

Conf Proc IEEE Eng Med Biol Soc. 2006;2006:3474-7. doi: 10.1109/IEMBS.2006.260445.

DOI:10.1109/IEMBS.2006.260445

PMID:17945779

Abstract

The discovery of repetitive patterns is a fundamental problem in bioinformatics. It remains a challenging open problem because most of the existing methods, such as using annotated repeat database and extracting pairs of maximum repeated regions, can not give a correct definition incorporating both the length and frequency factors of the repetitive patterns. There is an algorithm considering both the pattern length and frequency. However, it could only find the simple "elementary" repeats and is not able to reveal the complex structure of the repetitive patterns. Furthermore, its time complexity O(n2 f), where n is the length of the sequence, f is the minimum frequency requirement, could be still too high for long DNA sequences. In this paper, we propose a novel algorithm using suffix tree to reveal the complex structure of the repetitive patterns in DNA sequences. We show that our algorithm achieves an O(n2/f2) time complexity.

摘要

重复模式的发现是生物信息学中的一个基本问题。它仍然是一个具有挑战性的开放问题，因为大多数现有方法，如使用带注释的重复数据库和提取最大重复区域对，无法给出一个包含重复模式的长度和频率因素的正确定义。有一种算法同时考虑了模式长度和频率。然而，它只能找到简单的“基本”重复，无法揭示重复模式的复杂结构。此外，其时间复杂度为O(n2 f)，其中n是序列的长度，f是最小频率要求，对于长DNA序列来说可能仍然过高。在本文中，我们提出了一种使用后缀树来揭示DNA序列中重复模式复杂结构的新算法。我们表明，我们的算法实现了O(n2/f2)的时间复杂度。