Suppr超能文献

ProClust:基于扩展的图形方法改进蛋白质序列聚类

ProClust: improved clustering of protein sequences with an extended graph-based approach.

作者信息

Pipenbacher P, Schliep A, Schneckener S, Schönhuth A, Schomburg D, Schrader R

机构信息

ZAIK/ZPR, Universität zu Köln, Germany.

出版信息

Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.

Abstract

MOTIVATION

The problem of finding remote homologues of a given protein sequence via alignment methods is not fully solved. In fact, the task seems to become more difficult with more data. As the size of the database increases, so does the noise level; the highest alignment scores due to random similarities increase and can be higher than the alignment score between true homologues. Comparing two sequences with an arbitrary alignment method yields a similarity value which may indicate an evolutionary relationship between them. A threshold value is usually chosen to distinguish between true homologue relationships and random similarities. To compensate for the higher probability of spurious hits in larger databases, this threshold is increased. Increasing specificity however leads to decreased sensitivity as a matter of principle. Sensitivity can be recovered by utilizing refined protocols. A number of approaches to this challenge have made use of the fact that proteins are often members of some larger protein family. This can be exploited by using position-specific substitution matrices or profiles, or by making use of transitivity of homology. Transitivity refers to the concept of concluding homology between proteins A and C based on homology between A and a third protein B and between B and C. It has been demonstrated that transitivity can lead to substantial improvement in recognition of remote homologues particularly in cases where the alignment score of A and C is below the noise level. A natural limit to the use of transitivity is imposed by domains. Domains, compact independent sub-units of proteins, are often shared between otherwise distinct proteins, and can cause substantial problems by incorrectly linking otherwise unrelated proteins.

RESULTS

We extend a graph-based clustering algorithm which uses an asymmetric distance measure, scaling similarity values based on the length of the protein sequences compared. Additionally, the significance of alignment scores is taken into account and used for a filtering step in the algorithm. Post-processing, to merge further clusters based on profile HMMs is proposed. SCOP sequences and their super-family level classification are used as a test set for a clustering computed with our method for the joint data set containing both SCOP and SWISS-PROT. Note, the joint data set includes all multi-domain proteins, which contain the SCOP domains that are a potential source of incorrect links. Our method compares at high specificities very favorably with PSI-Blast, which is probably the most widely-used tool for finding remote homologues. We demonstrate that using transitivity with as many as twelve intermediate sequences is crucial to achieving this level of performance. Moreover, from analysis of false positives we conclude that our method seems to correctly bound the degree of transitivity used. This analysis also yields explicit guidance in choosing parameters. The heuristics of the asymmetric distance measure used neither solve the multi-domain problem from a theoretical point of view, nor do they avoid all types of problems we have observed in real data. Nevertheless, they do provide a substantial improvement over existing approaches.

AVAILABILITY

The complete software source is freely available to all users under the GNU General Public License (GPL) from http://www.bioinformatik.uni-koeln.de/~proclust/download/

摘要

动机

通过比对方法寻找给定蛋白质序列的远程同源物的问题尚未完全解决。事实上,随着数据量的增加,这项任务似乎变得更加困难。随着数据库规模的增大,噪声水平也随之升高;由于随机相似性导致的最高比对分数增加,并且可能高于真正同源物之间的比对分数。使用任意比对方法比较两个序列会产生一个相似性值,该值可能表明它们之间的进化关系。通常会选择一个阈值来区分真正的同源物关系和随机相似性。为了补偿在更大数据库中出现假阳性的更高概率,这个阈值会提高。然而,从原则上讲,提高特异性会导致灵敏度降低。可以通过使用改进的方案来恢复灵敏度。应对这一挑战的一些方法利用了蛋白质通常是某些更大蛋白质家族成员这一事实。这可以通过使用位置特异性替换矩阵或谱,或者利用同源性的传递性来实现。传递性是指基于蛋白质A与第三个蛋白质B以及B与C之间的同源性推断A和C之间同源性的概念。已经证明,传递性可以显著提高对远程同源物的识别,特别是在A和C的比对分数低于噪声水平的情况下。使用传递性的一个自然限制是由结构域施加的。结构域是蛋白质紧凑的独立亚基,通常在其他方面不同的蛋白质之间共享,并且可能通过错误地连接原本不相关的蛋白质而导致严重问题。

结果

我们扩展了一种基于图的聚类算法,该算法使用不对称距离度量,根据所比较蛋白质序列的长度对相似性值进行缩放。此外,还考虑了比对分数的显著性,并将其用于算法中的过滤步骤。提出了后处理方法,以基于轮廓隐马尔可夫模型合并更多聚类。SCOP序列及其超家族水平分类用作测试集,用于对包含SCOP和SWISS - PROT的联合数据集使用我们的方法进行聚类计算。请注意,联合数据集包括所有多结构域蛋白质,其中包含SCOP结构域,这些结构域是错误连接的潜在来源。我们的方法在高特异性下与PSI - Blast相比具有很大优势,PSI - Blast可能是用于寻找远程同源物的最广泛使用的工具。我们证明,使用多达十二个中间序列的传递性对于实现这种性能水平至关重要。此外,通过对假阳性的分析,我们得出结论,我们的方法似乎正确地限制了所使用的传递性程度。该分析还为参数选择提供了明确的指导。所使用的不对称距离度量的启发式方法从理论角度既未解决多结构域问题,也未避免我们在实际数据中观察到的所有类型的问题。然而,它们确实比现有方法有了实质性的改进。

可用性

完整的软件源代码根据GNU通用公共许可证(GPL)向所有用户免费提供,可从http://www.bioinformatik.uni - koeln.de/~proclust/download/获取

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验