• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ProClust:基于扩展的图形方法改进蛋白质序列聚类

ProClust: improved clustering of protein sequences with an extended graph-based approach.

作者信息

Pipenbacher P, Schliep A, Schneckener S, Schönhuth A, Schomburg D, Schrader R

机构信息

ZAIK/ZPR, Universität zu Köln, Germany.

出版信息

Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.

DOI:10.1093/bioinformatics/18.suppl_2.s182
PMID:12386002
Abstract

MOTIVATION

The problem of finding remote homologues of a given protein sequence via alignment methods is not fully solved. In fact, the task seems to become more difficult with more data. As the size of the database increases, so does the noise level; the highest alignment scores due to random similarities increase and can be higher than the alignment score between true homologues. Comparing two sequences with an arbitrary alignment method yields a similarity value which may indicate an evolutionary relationship between them. A threshold value is usually chosen to distinguish between true homologue relationships and random similarities. To compensate for the higher probability of spurious hits in larger databases, this threshold is increased. Increasing specificity however leads to decreased sensitivity as a matter of principle. Sensitivity can be recovered by utilizing refined protocols. A number of approaches to this challenge have made use of the fact that proteins are often members of some larger protein family. This can be exploited by using position-specific substitution matrices or profiles, or by making use of transitivity of homology. Transitivity refers to the concept of concluding homology between proteins A and C based on homology between A and a third protein B and between B and C. It has been demonstrated that transitivity can lead to substantial improvement in recognition of remote homologues particularly in cases where the alignment score of A and C is below the noise level. A natural limit to the use of transitivity is imposed by domains. Domains, compact independent sub-units of proteins, are often shared between otherwise distinct proteins, and can cause substantial problems by incorrectly linking otherwise unrelated proteins.

RESULTS

We extend a graph-based clustering algorithm which uses an asymmetric distance measure, scaling similarity values based on the length of the protein sequences compared. Additionally, the significance of alignment scores is taken into account and used for a filtering step in the algorithm. Post-processing, to merge further clusters based on profile HMMs is proposed. SCOP sequences and their super-family level classification are used as a test set for a clustering computed with our method for the joint data set containing both SCOP and SWISS-PROT. Note, the joint data set includes all multi-domain proteins, which contain the SCOP domains that are a potential source of incorrect links. Our method compares at high specificities very favorably with PSI-Blast, which is probably the most widely-used tool for finding remote homologues. We demonstrate that using transitivity with as many as twelve intermediate sequences is crucial to achieving this level of performance. Moreover, from analysis of false positives we conclude that our method seems to correctly bound the degree of transitivity used. This analysis also yields explicit guidance in choosing parameters. The heuristics of the asymmetric distance measure used neither solve the multi-domain problem from a theoretical point of view, nor do they avoid all types of problems we have observed in real data. Nevertheless, they do provide a substantial improvement over existing approaches.

AVAILABILITY

The complete software source is freely available to all users under the GNU General Public License (GPL) from http://www.bioinformatik.uni-koeln.de/~proclust/download/

摘要

动机

通过比对方法寻找给定蛋白质序列的远程同源物的问题尚未完全解决。事实上,随着数据量的增加,这项任务似乎变得更加困难。随着数据库规模的增大,噪声水平也随之升高;由于随机相似性导致的最高比对分数增加,并且可能高于真正同源物之间的比对分数。使用任意比对方法比较两个序列会产生一个相似性值,该值可能表明它们之间的进化关系。通常会选择一个阈值来区分真正的同源物关系和随机相似性。为了补偿在更大数据库中出现假阳性的更高概率,这个阈值会提高。然而,从原则上讲,提高特异性会导致灵敏度降低。可以通过使用改进的方案来恢复灵敏度。应对这一挑战的一些方法利用了蛋白质通常是某些更大蛋白质家族成员这一事实。这可以通过使用位置特异性替换矩阵或谱,或者利用同源性的传递性来实现。传递性是指基于蛋白质A与第三个蛋白质B以及B与C之间的同源性推断A和C之间同源性的概念。已经证明,传递性可以显著提高对远程同源物的识别,特别是在A和C的比对分数低于噪声水平的情况下。使用传递性的一个自然限制是由结构域施加的。结构域是蛋白质紧凑的独立亚基,通常在其他方面不同的蛋白质之间共享,并且可能通过错误地连接原本不相关的蛋白质而导致严重问题。

结果

我们扩展了一种基于图的聚类算法,该算法使用不对称距离度量,根据所比较蛋白质序列的长度对相似性值进行缩放。此外,还考虑了比对分数的显著性,并将其用于算法中的过滤步骤。提出了后处理方法,以基于轮廓隐马尔可夫模型合并更多聚类。SCOP序列及其超家族水平分类用作测试集,用于对包含SCOP和SWISS - PROT的联合数据集使用我们的方法进行聚类计算。请注意,联合数据集包括所有多结构域蛋白质,其中包含SCOP结构域,这些结构域是错误连接的潜在来源。我们的方法在高特异性下与PSI - Blast相比具有很大优势,PSI - Blast可能是用于寻找远程同源物的最广泛使用的工具。我们证明,使用多达十二个中间序列的传递性对于实现这种性能水平至关重要。此外,通过对假阳性的分析,我们得出结论,我们的方法似乎正确地限制了所使用的传递性程度。该分析还为参数选择提供了明确的指导。所使用的不对称距离度量的启发式方法从理论角度既未解决多结构域问题,也未避免我们在实际数据中观察到的所有类型的问题。然而,它们确实比现有方法有了实质性的改进。

可用性

完整的软件源代码根据GNU通用公共许可证(GPL)向所有用户免费提供,可从http://www.bioinformatik.uni - koeln.de/~proclust/download/获取

相似文献

1
ProClust: improved clustering of protein sequences with an extended graph-based approach.ProClust:基于扩展的图形方法改进蛋白质序列聚类
Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.
2
Clustering protein sequences--structure prediction by transitive homology.蛋白质序列聚类——通过传递同源性进行结构预测
Bioinformatics. 2001 Oct;17(10):935-41. doi: 10.1093/bioinformatics/17.10.935.
3
Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.离散与连续蛋白质结构空间之间的交叉:对蛋白质结构自动分类及网络的见解。
PLoS Comput Biol. 2009 Mar;5(3):e1000331. doi: 10.1371/journal.pcbi.1000331. Epub 2009 Mar 27.
4
A comprehensive system for evaluation of remote sequence similarity detection.一种用于评估远程序列相似性检测的综合系统。
BMC Bioinformatics. 2007 Aug 28;8:314. doi: 10.1186/1471-2105-8-314.
5
Towards an automatic classification of protein structural domains based on structural similarity.基于结构相似性的蛋白质结构域自动分类研究
BMC Bioinformatics. 2008 Jan 31;9:74. doi: 10.1186/1471-2105-9-74.
6
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.支持向量机折叠法:一种用于判别式多类别蛋白质折叠和超家族识别的工具。
BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.
7
Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm.用于通过马尔可夫聚类算法进行同源性推断的基于BLAST的边加权指标评估。
BMC Bioinformatics. 2015 Jul 10;16:218. doi: 10.1186/s12859-015-0625-x.
8
SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection.SVM-HUSTLE——一种用于成对蛋白质远程同源性检测的迭代半监督机器学习方法。
Bioinformatics. 2008 Mar 15;24(6):783-90. doi: 10.1093/bioinformatics/btn028. Epub 2008 Feb 1.
9
Large-scale comparison of protein sequence alignment algorithms with structure alignments.蛋白质序列比对算法与结构比对的大规模比较。
Proteins. 2000 Jul 1;40(1):6-22. doi: 10.1002/(sici)1097-0134(20000701)40:1<6::aid-prot30>3.0.co;2-7.
10
A sequence alignment-independent method for protein classification.一种与序列比对无关的蛋白质分类方法。
Appl Bioinformatics. 2004;3(2-3):137-48. doi: 10.2165/00822942-200403020-00008.

引用本文的文献

1
Sequence alignment generation using intermediate sequence search for homology modeling.使用中间序列搜索进行同源建模的序列比对生成。
Comput Struct Biotechnol J. 2020 Jul 25;18:2043-2050. doi: 10.1016/j.csbj.2020.07.012. eCollection 2020.
2
Graph-Directed Approach for Downselecting Toxins for Experimental Structure Determination.基于图论的方法用于筛选实验结构测定用毒素。
Mar Drugs. 2020 May 14;18(5):256. doi: 10.3390/md18050256.
3
GenFamClust: an accurate, synteny-aware and reliable homology inference algorithm.GenFamClust:一种准确、具有共线性意识且可靠的同源性推断算法。
BMC Evol Biol. 2016 Jun 4;16(1):120. doi: 10.1186/s12862-016-0684-2.
4
Comprehensive computational analysis of bacterial CRP/FNR superfamily and its target motifs reveals stepwise evolution of transcriptional networks.综合计算分析细菌 CRP/FNR 超家族及其靶基序揭示了转录网络的逐步进化。
Genome Biol Evol. 2013;5(2):267-82. doi: 10.1093/gbe/evt004.
5
GFam: a platform for automatic annotation of gene families.GFam:一个用于基因家族自动注释的平台。
Nucleic Acids Res. 2012 Oct;40(19):e152. doi: 10.1093/nar/gks631. Epub 2012 Jul 11.
6
Ultrafast clustering algorithms for metagenomic sequence analysis.用于宏基因组序列分析的超快聚类算法。
Brief Bioinform. 2012 Nov;13(6):656-68. doi: 10.1093/bib/bbs035. Epub 2012 Jul 6.
7
Integrating overlapping structures and background information of words significantly improves biological sequence comparison.整合单词的重叠结构和背景信息能显著提高生物序列比较的效果。
PLoS One. 2011;6(11):e26779. doi: 10.1371/journal.pone.0026779. Epub 2011 Nov 10.
8
BrEPS: a flexible and automatic protocol to compute enzyme-specific sequence profiles for functional annotation.BrEPS:一种用于计算酶特异性序列轮廓以进行功能注释的灵活自动协议。
BMC Bioinformatics. 2010 Dec 1;11:589. doi: 10.1186/1471-2105-11-589.
9
Genome-wide comparative gene family classification.全基因组比较基因家族分类。
PLoS One. 2010 Oct 15;5(10):e13409. doi: 10.1371/journal.pone.0013409.
10
SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale.SCPS:一种快速实现的基于谱方法的全基因组蛋白质家族检测。
BMC Bioinformatics. 2010 Mar 9;11:120. doi: 10.1186/1471-2105-11-120.