• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

DIVCLUS:GEANFAMMER软件包中的一种自动方法,可在单结构域和多结构域蛋白质中找到同源结构域。

DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins.

作者信息

Park J, Teichmann S A

机构信息

MRC Laboratory of Molecular Biology, Cambridge, UK.

出版信息

Bioinformatics. 1998;14(2):144-50. doi: 10.1093/bioinformatics/14.2.144.

DOI:10.1093/bioinformatics/14.2.144
PMID:9545446
Abstract

MOTIVATION

Large-scale determination of relationships between the proteins produced by genome sequences is now common. All protein sequences are matched and those that have high match scores are clustered into families. In cases where the proteins are built of several domains or duplication modules, this can lead to misleading results. Consider the very simple example of three proteins: 1, formed by duplication modules A and B; 2, formed by duplication modules B' and C; and 3, formed by duplication modules C' and D. Duplication modules B and B' are homologous, as are C and C'. Matching the sequences of 1, 2 and 3 followed by simple single-linkage clustering would put all three in the same family, even though proteins 1 and 3 are not related. This is because the different parts of 2 match 1 and 3. This paper describes a procedure, DIVCLUS, that divides such complex clusters of partially related sequences into simple clusters that contain only related duplication modules. In the example just given, it would produce two groups of sequences: the first with domains B of sequence 1 and B of sequence 2, and the second with domain C of sequence 2 and C of sequence 3. DIVCLUS is part of a package called GEANFAMMER, for GEnome ANalysis and protein FAMily MakER. The package automates the detection of families of duplication modules from a protein sequence database.

RESULTS

DIVCLUS has been applied to the division of single-linkage clusters generated from the protein sequences of six completely sequenced bacterial genomes. Out of 12 013 genes in these six genomes, 4563 single- and multi-domain sequences formed 1071 complex clusters. Application of the DIVCLUS program resolved these clusters into 2113 clusters corresponding to single duplication modules.

AVAILABILITY

The perl5 program and its documentation are available at the following address: http://www.mrc-lmb.cam.ac.uk/genomes/ and by anonymous ftp at ftp.mrc-lmb.cam.ac.uk in the directory /pub/genomes/Software/.

CONTACT

sat@mrc-lmb.cam.ac.uk; jong@mrc-lmb. cam.ac.uk

摘要

动机

大规模确定基因组序列所产生蛋白质之间的关系如今已很常见。所有蛋白质序列相互比对,那些具有高匹配分数的序列被聚类成家族。在蛋白质由多个结构域或重复模块构成的情况下,这可能会导致误导性结果。考虑三个蛋白质的非常简单的例子:蛋白质1由重复模块A和B构成;蛋白质2由重复模块B'和C构成;蛋白质3由重复模块C'和D构成。重复模块B和B'是同源的,C和C'也是同源的。对蛋白质1、2和3的序列进行比对,然后进行简单的单链聚类,会将这三个蛋白质都归入同一个家族,尽管蛋白质1和3并无关联。这是因为蛋白质2的不同部分与蛋白质1和3相匹配。本文描述了一种名为DIVCLUS的程序,它能将这种部分相关序列的复杂聚类划分为仅包含相关重复模块的简单聚类。在刚才给出的例子中,它会产生两组序列:第一组包含蛋白质1的结构域B和蛋白质2的结构域B,第二组包含蛋白质2的结构域C和蛋白质3的结构域C。DIVCLUS是名为GEANFAMMER(基因组分析和蛋白质家族生成器)软件包的一部分。该软件包可自动从蛋白质序列数据库中检测重复模块家族。

结果

DIVCLUS已应用于对六个完全测序的细菌基因组的蛋白质序列生成的单链聚类进行划分。在这六个基因组的12013个基因中,4563个单结构域和多结构域序列形成了1071个复杂聚类。DIVCLUS程序的应用将这些聚类解析为对应于单个重复模块的2113个聚类。

可用性

perl5程序及其文档可在以下地址获取:http://www.mrc-lmb.cam.ac.uk/genomes/ ,也可通过匿名ftp从ftp.mrc-lmb.cam.ac.uk的/pub/genomes/Software/目录获取。

联系方式

sat@mrc-lmb.cam.ac.uk;jong@mrc-lmb.cam.ac.uk

相似文献

1
DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins.DIVCLUS:GEANFAMMER软件包中的一种自动方法,可在单结构域和多结构域蛋白质中找到同源结构域。
Bioinformatics. 1998;14(2):144-50. doi: 10.1093/bioinformatics/14.2.144.
2
Sequence search algorithm assessment and testing toolkit (SAT).序列搜索算法评估与测试工具包(SAT)
Bioinformatics. 2000 Feb;16(2):104-10. doi: 10.1093/bioinformatics/16.2.104.
3
Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.使用代表所有已知结构蛋白质的隐马尔可夫模型库将同源性分配给基因组序列。
J Mol Biol. 2001 Nov 2;313(4):903-19. doi: 10.1006/jmbi.2001.5080.
4
Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities.自动化蛋白质序列数据库分类。II. 从序列相似性描绘结构域边界
Bioinformatics. 1998;14(2):174-87. doi: 10.1093/bioinformatics/14.2.174.
5
Computational space reduction and parallelization of a new clustering approach for large groups of sequences.针对大量序列的一种新聚类方法的计算空间缩减与并行化
Bioinformatics. 1998 Jun;14(5):439-51. doi: 10.1093/bioinformatics/14.5.439.
6
Fast assignment of protein structures to sequences using the intermediate sequence library PDB-ISL.使用中间序列库PDB-ISL将蛋白质结构快速分配给序列。
Bioinformatics. 2000 Feb;16(2):117-24. doi: 10.1093/bioinformatics/16.2.117.
7
JOY: protein sequence-structure representation and analysis.JOY:蛋白质序列-结构表示与分析
Bioinformatics. 1998;14(7):617-23. doi: 10.1093/bioinformatics/14.7.617.
8
Modular arrangement of proteins as inferred from analysis of homology.从同源性分析推断出的蛋白质模块化排列。
Protein Sci. 1994 Mar;3(3):482-92. doi: 10.1002/pro.5560030314.
9
Exhaustive enumeration of protein domain families.蛋白质结构域家族的详尽枚举。
J Mol Biol. 2003 May 2;328(3):749-67. doi: 10.1016/s0022-2836(03)00269-9.
10
TOPAL: recombination detection in DNA and protein sequences.TOPAL:DNA和蛋白质序列中的重组检测
Bioinformatics. 1998;14(2):219-20. doi: 10.1093/bioinformatics/14.2.219.

引用本文的文献

1
A pluralistic account of homology: adapting the models to the data.多元论的同源关系解释:使模型适应数据。
Mol Biol Evol. 2014 Mar;31(3):501-16. doi: 10.1093/molbev/mst228. Epub 2013 Nov 22.
2
SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.SECOM:一种基于新型哈希种子和社区检测的全基因组蛋白质结构域识别方法。
PLoS One. 2012;7(6):e39475. doi: 10.1371/journal.pone.0039475. Epub 2012 Jun 28.
3
DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning.
多宝:通过整合进化信号和机器学习进行蛋白质结构域边界预测。
BMC Bioinformatics. 2011 Feb 1;12:43. doi: 10.1186/1471-2105-12-43.
4
DOMAC: an accurate, hybrid protein domain prediction server.DOMAC:一个准确的混合蛋白质结构域预测服务器。
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W354-6. doi: 10.1093/nar/gkm390. Epub 2007 Jun 6.
5
A limited universe of membrane protein families and folds.膜蛋白家族和折叠的有限范围。
Protein Sci. 2006 Jul;15(7):1723-34. doi: 10.1110/ps.062109706.
6
Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space.对203个基因组的全面基因组分析为结构基因组学提供了关于蛋白质家族空间的新见解。
Nucleic Acids Res. 2006 Feb 15;34(3):1066-80. doi: 10.1093/nar/gkj494. Print 2006.
7
Progress towards mapping the universe of protein folds.绘制蛋白质折叠图谱的进展。
Genome Biol. 2004;5(5):107. doi: 10.1186/gb-2004-5-5-107. Epub 2004 Apr 29.
8
A hybrid clustering approach to recognition of protein families in 114 microbial genomes.一种用于识别114个微生物基因组中蛋白质家族的混合聚类方法。
BMC Bioinformatics. 2004 Apr 29;5:45. doi: 10.1186/1471-2105-5-45.
9
Prediction of protein domain boundaries from sequence alone.仅从序列预测蛋白质结构域边界。
Protein Sci. 2003 Apr;12(4):696-701. doi: 10.1110/ps.0233103.
10
Cloning and sequencing of cDNAs for hypothetical genes from chromosome 2 of Arabidopsis.拟南芥2号染色体上假定基因的cDNA克隆与测序
Plant Physiol. 2002 Dec;130(4):2118-28. doi: 10.1104/pp.010207.