利用显著性和传递性追踪重复序列。

Tracking repeats using significance and transitivity.

作者信息

Szklarczyk Radek, Heringa Jaap

机构信息

Centre for Integrative Bioinformatics (IBIVU), Faculty of Sciences and Faculty of Earth and Life Sciences, Vrije Universiteit Amsterdam, De Boelelaan 1081A, Amsterdam, The Netherlands.

出版信息

Bioinformatics. 2004 Aug 4;20 Suppl 1:i311-7. doi: 10.1093/bioinformatics/bth911.

DOI:10.1093/bioinformatics/bth911

PMID:15262814

Abstract

MOTIVATION

Internal repeats in coding sequences correspond to structural and functional units of proteins. Moreover, duplication of fragments of coding sequences is known to be a mechanism to facilitate evolution. Identification of repeats is crucial to shed light on the function and structure of proteins, and explain their evolutionary past. The task is difficult because during the course of evolution many repeats diverged beyond recognition.

RESULTS

We introduce a new method TRUST, for ab initio determination of internal repeats in proteins. It provides an improvement in prediction quality as compared to alternative state-of-the-art methods. The increased sensitivity and accuracy of the method is achieved by exploiting the concept of transitivity of alignments. Starting from significant local suboptimal alignments, the application of transitivity allows us to (1) identify distant repeat homologues for which no alignments were found; (2) gain confidence about consistently well-aligned regions; and (3) recognize and reduce the contribution of non-homologous repeats. This re-assessment step enables us to derive a virtually noise-free profile representing a generalized repeat with high fidelity. We also obtained superior specificity by employing rigid statistical testing for self-sequence and profile-sequence alignments. Assessment was done using a database of repeat annotations based on structural superpositioning. The results show that TRUST is a useful and reliable tool for mining tandem and non-tandem repeats in protein sequence databases, capable of predicting multiple repeat types with varying intervening segments within a single sequence.

AVAILABILITY

The TRUST server (together with the source code) is available at http://ibivu.cs.vu.nl/programs/trustwww

摘要

动机

编码序列中的内部重复对应于蛋白质的结构和功能单元。此外，已知编码序列片段的复制是促进进化的一种机制。重复序列的识别对于阐明蛋白质的功能和结构以及解释其进化历程至关重要。这项任务很困难，因为在进化过程中，许多重复序列已经分化到无法识别的程度。

结果

我们引入了一种新方法TRUST，用于从头确定蛋白质中的内部重复序列。与其他最先进的方法相比，它在预测质量上有了改进。该方法通过利用比对的传递性概念提高了灵敏度和准确性。从显著的局部次优比对开始，传递性的应用使我们能够：（1）识别未找到比对的远距离重复同源物；（2）增强对一致良好比对区域的信心；（3）识别并减少非同源重复序列的贡献。这个重新评估步骤使我们能够得出一个几乎无噪声的图谱，它以高保真度代表一个广义的重复序列。我们还通过对自我序列和图谱序列比对采用严格的统计测试获得了更高的特异性。使用基于结构叠加的重复注释数据库进行了评估。结果表明，TRUST是在蛋白质序列数据库中挖掘串联和非串联重复序列的一个有用且可靠的工具，能够预测单个序列中具有不同间隔片段的多种重复类型。