• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于图形的聚类方法,用于在大量蛋白质序列中寻找远亲关系。

Graph-based clustering for finding distant relationships in a large set of protein sequences.

作者信息

Kawaji Hideya, Takenaka Yoichi, Matsuda Hideo

机构信息

Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan.

出版信息

Bioinformatics. 2004 Jan 22;20(2):243-52. doi: 10.1093/bioinformatics/btg397.

DOI:10.1093/bioinformatics/btg397
PMID:14734316
Abstract

MOTIVATION

Clustering of protein sequences is widely used for the functional characterization of proteins. However, it is still not easy to cluster distantly-related proteins, which have only regional similarity among their sequences. It is therefore necessary to develop an algorithm for clustering such distantly-related proteins.

RESULTS

We have developed a time and space efficient clustering algorithm. It uses a graph representation where its vertices and edges denote proteins and their sequence similarities above a certain cutoff score, respectively. It repeatedly partitions the graph by removing edges that have small weights, which correspond to low sequence similarities. To find the appropriate partitions, we introduce a score combining the normalized cut and a locally minimal cut capacities. Our method is applied to the entire 40,703 human proteins in SWISS-PROT and TrEMBL. The resulting clusters shows a 76% recall (20,529 proteins) of the 26,917 classified by InterPro. It also finds relationships not found by other clustering methods.

AVAILABILITY

The complete result of our algorithm for all the human proteins in SWISS-PROT and TrEMBL, and other supplementary information are available at http://motif.ics.es.osaka-u.ac.jp/Ncut-KL/

摘要

动机

蛋白质序列聚类广泛用于蛋白质的功能表征。然而,对远缘相关蛋白质进行聚类仍然不容易,这些蛋白质在序列之间仅具有区域相似性。因此,有必要开发一种算法来对这种远缘相关蛋白质进行聚类。

结果

我们开发了一种时空高效的聚类算法。它使用一种图表示,其中其顶点和边分别表示蛋白质及其高于某个截止分数的序列相似性。它通过去除权重小的边(对应于低序列相似性)来反复划分图。为了找到合适的划分,我们引入了一个结合归一化割和局部最小割容量的分数。我们的方法应用于SWISS-PROT和TrEMBL中的全部40,703个人类蛋白质。所得聚类显示,在InterPro分类的26,917个蛋白质中召回率为76%(20,529个蛋白质)。它还发现了其他聚类方法未发现的关系。

可用性

我们算法对SWISS-PROT和TrEMBL中所有人蛋白质的完整结果以及其他补充信息可在http://motif.ics.es.osaka-u.ac.jp/Ncut-KL/获取。

相似文献

1
Graph-based clustering for finding distant relationships in a large set of protein sequences.基于图形的聚类方法,用于在大量蛋白质序列中寻找远亲关系。
Bioinformatics. 2004 Jan 22;20(2):243-52. doi: 10.1093/bioinformatics/btg397.
2
A graph-based clustering method for a large set of sequences using a graph partitioning algorithm.一种使用图划分算法对大量序列进行基于图的聚类方法。
Genome Inform. 2001;12:93-102.
3
ProClust: improved clustering of protein sequences with an extended graph-based approach.ProClust:基于扩展的图形方法改进蛋白质序列聚类
Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.
4
The metric space of proteins-comparative study of clustering algorithms.蛋白质的度量空间——聚类算法的比较研究
Bioinformatics. 2002;18 Suppl 1:S14-21. doi: 10.1093/bioinformatics/18.suppl_1.s14.
5
Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space.用于对海量数据集进行精确层次聚类的高效算法:攻克整个蛋白质空间
Bioinformatics. 2008 Jul 1;24(13):i41-9. doi: 10.1093/bioinformatics/btn174.
6
Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.离散与连续蛋白质结构空间之间的交叉:对蛋白质结构自动分类及网络的见解。
PLoS Comput Biol. 2009 Mar;5(3):e1000331. doi: 10.1371/journal.pcbi.1000331. Epub 2009 Mar 27.
7
Clustering protein sequences--structure prediction by transitive homology.蛋白质序列聚类——通过传递同源性进行结构预测
Bioinformatics. 2001 Oct;17(10):935-41. doi: 10.1093/bioinformatics/17.10.935.
8
Measuring the similarity of protein structures by means of the universal similarity metric.通过通用相似性度量来测量蛋白质结构的相似性。
Bioinformatics. 2004 May 1;20(7):1015-21. doi: 10.1093/bioinformatics/bth031. Epub 2004 Jan 29.
9
Euclidian space and grouping of biological objects.欧几里得空间与生物对象的分组
Bioinformatics. 2002 Nov;18(11):1523-34. doi: 10.1093/bioinformatics/18.11.1523.
10
Clustering of proximal sequence space for the identification of protein families.用于识别蛋白质家族的近端序列空间聚类
Bioinformatics. 2002 Jul;18(7):908-21. doi: 10.1093/bioinformatics/18.7.908.

引用本文的文献

1
A new clustering method based on multipartite networks.一种基于多部分网络的新聚类方法。
PeerJ Comput Sci. 2023 Oct 13;9:e1621. doi: 10.7717/peerj-cs.1621. eCollection 2023.
2
Ranking and compacting binding segments of protein families using aligned pattern clusters.利用对齐模式簇对蛋白质家族的结合片段进行排序和压缩。
Proteome Sci. 2013 Nov 7;11(Suppl 1):S8. doi: 10.1186/1477-5956-11-S1-S8.
3
Automatic classification of protein structures relying on similarities between alignments.基于比对间相似性的蛋白质结构自动分类。
BMC Bioinformatics. 2012 Sep 14;13:233. doi: 10.1186/1471-2105-13-233.
4
Objective sequence-based subfamily classifications of mouse homeodomains reflect their in vitro DNA-binding preferences.基于目标序列的小鼠同源域亚家族分类反映了它们在体外的 DNA 结合偏好。
Nucleic Acids Res. 2010 Dec;38(22):7927-42. doi: 10.1093/nar/gkq714. Epub 2010 Aug 12.
5
Large scale hierarchical clustering of protein sequences.蛋白质序列的大规模层次聚类
BMC Bioinformatics. 2005 Jan 22;6:15. doi: 10.1186/1471-2105-6-15.