一种用于近似串联重复序列的算法。

An algorithm for approximate tandem repeats.

作者信息

Landau G M, Schmidt J P, Sokol D

机构信息

Department of Computer Science, Haifa University, Haifa 31905, Israel.

出版信息

J Comput Biol. 2001;8(1):1-18. doi: 10.1089/106652701300099038.

DOI:10.1089/106652701300099038

PMID:11339903

Abstract

A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro û, for which the Hamming distance of umacro and û is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and û is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.

摘要

完美单串联重复被定义为一个非空字符串，它可以被分成两个相同的子串，例如abcabc。近似单串联重复是指子串相似但不相同的情况，例如abcdaacd。在本文中，我们考虑两种相似性标准：汉明距离（k个错配）和编辑距离（k个差异）。对于长度为n的字符串S和整数k，我们的算法在O(nk log (n/k))时间内报告所有局部最优近似重复r = umacro û，其中umacro和û的汉明距离至多为k，或者在O(nk log k log (n/k))时间内报告所有umacro和û的编辑距离至多为k的情况。本文专注于一种更一般类型的重复，称为多重串联重复。序列S中的多重串联重复是S的一个（周期性）子串r，形式为r = u(a)u'，其中u是r的前缀，u'是u的前缀。近似多重串联重复是带有错误的多重重复；重复的子序列相似但不相同。我们精确地定义了近似多重重复，并提出了一种算法来找到所有符合我们定义的重复。当在长度为n的字符串S中搜索至多有k个错误的重复时，该算法的时间复杂度为O(nka log (n/k))，其中a是任何报告的重复中周期的最大数量。我们展示了一些关于我们算法性能和灵敏度的实验结果。在字符串中寻找重复的问题是一个在分子生物学领域有重要应用的计算问题。精确和不精确的重复在基因组中都经常出现，并且已知基因组中出现的某些重复与人类疾病有关。