CNR, Istituto di Informatica e Telematica, Via Moruzzi 1, 56124 Pisa, Italy.
Bioinformatics. 2010 Jun 15;26(12):i358-66. doi: 10.1093/bioinformatics/btq209.
Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events.
We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently TRs that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and deletions. To attain this goal, we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem is akin to the 'generalized median string' that is known to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better than current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences.
TRStalker will be integrated in the web-based TRs Discovery Service (TReaDS) at bioalgo.iit.cnr.it.
Supplementary data are available at Bioinformatics online.
高等真核生物的基因组包含大量重复序列。串联重复(TRs)构成了一类大量的重复序列,它们通过复制滑动等现象起源,其特征是紧密的空间连续性。它们在几种分子调控机制中发挥着重要作用,也在几种疾病(例如三核苷酸重复障碍组)中发挥着重要作用。虽然对于低或中等分化水平的 TRs,当前的方法相当有效,但检测分化程度较高的 TRs(模糊 TRs)的问题仍然存在。检测模糊 TRs 有助于丰富我们对其在调控机制和疾病中的作用的认识。模糊 TRs 也是重要的工具,可以揭示它们在基因组进化历史中的作用,其中更高的分化程度与更遥远的重复事件相关。
我们开发了一种算法(命名为 TRStalker),旨在有效地检测由于其内在的模糊性而难以检测的 TRs,这种模糊性是由于高水平的碱基替换、插入和缺失造成的。为了实现这一目标,我们开发了启发式算法来解决一个 Steiner 版本的问题,其中模糊性是相对于不一定存在于输入字符串中的 motif 字符串来测量的。这个问题类似于“广义中值字符串”,已知其是一个 NP 难问题。使用合成和生物序列的实验表明,我们的方法在模糊 TRs 方面优于当前的最新技术,并且我们检测到的模糊 TRs 确实存在于重要的生物序列中。
TRStalker 将集成到基于网络的 TRs Discovery Service(TReaDS)中,网址为 bioalgo.iit.cnr.it。
补充数据可在 Bioinformatics 在线获取。