• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质序列的最佳分类及从多序列比对中选择代表性序列集:在同源家族中的应用及对结构基因组学的启示

Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics.

作者信息

May A C

机构信息

Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 lAA, UK.

出版信息

Protein Eng. 2001 Apr;14(4):209-17. doi: 10.1093/protein/14.4.209.

DOI:10.1093/protein/14.4.209
PMID:11391012
Abstract

Hierarchical classification is probably the most popular approach to group related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number of subsets from a tree. In fact, classification of proteins in general is bedevilled with the need for subjective thresholds to define group membership (e.g., 'significant' sequence identity for homologous families). Such arbitrariness is not only intellectually unsatisfying but also has important practical consequences. For instance, it hinders meaningful identification of protein targets for structural genomics. I describe an alternative approach to cluster related proteins without the need for an a priori threshold: one, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity. Grouping proteins according to weights assigned to their aligned sequences makes it possible to delineate dynamically a 'core-periphery' structure within families. The 'core' of a protein family comprises the most typical sequences while the 'periphery' consists of the atypical ones. Further, a new sequence weighting scheme that combines the information in all the multiply aligned positions of an alignment in a novel way is put forward. Instead of averaging over all positions, this procedure takes into account directly the distribution of sequence variability along an alignment. The relationships between sequence weights and sequence identity are investigated for 168 families taken from HOMSTRAD, a database of protein structure alignments for homologous families. An exact solution is presented for the problem of how to select the most representative pair of sequences for a protein family. Extension of this approach by a greedy algorithm allows automatic identification of a minimal set of aligned sequences. The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.

摘要

层次分类可能是对相关蛋白质进行分组最常用的方法。然而,将其用于此目的存在一些问题。一个问题是,所得的显示分组嵌套序列的树可能不是数据的最合适表示形式。另一个问题是,目视检查是从树中确定最合适子集数量的最常用方法。实际上,一般来说,蛋白质分类因需要主观阈值来定义组成员身份(例如,同源家族的“显著”序列同一性)而受到困扰。这种随意性不仅在智力上不能令人满意,而且还具有重要的实际后果。例如,它阻碍了结构基因组学中蛋白质靶标的有意义识别。我描述了一种无需先验阈值即可对相关蛋白质进行聚类的替代方法:一种通过使用动态规划的方法,该方法保证在所有分区粒度级别上都能产生全局最优解。根据分配给其比对序列的权重对蛋白质进行分组,使得能够动态地描绘家族内的“核心 - 外围”结构。蛋白质家族的“核心”由最典型的序列组成,而“外围”则由非典型序列组成。此外,还提出了一种新的序列加权方案,该方案以一种新颖的方式组合了比对中所有多重比对位置的信息。此过程不是对所有位置求平均值,而是直接考虑沿比对的序列变异性分布。针对取自HOMSTRAD(一个同源家族蛋白质结构比对数据库)的168个家族,研究了序列权重与序列同一性之间的关系。针对如何为蛋白质家族选择最具代表性的一对序列的问题,给出了一个精确解。通过贪婪算法扩展此方法,可以自动识别一组最小的比对序列。此分析结果可在网页http://mathbio.nimr.mrc.ac.uk/~amay上获取。

相似文献

1
Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics.蛋白质序列的最佳分类及从多序列比对中选择代表性序列集:在同源家族中的应用及对结构基因组学的启示
Protein Eng. 2001 Apr;14(4):209-17. doi: 10.1093/protein/14.4.209.
2
Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment.自动化蛋白质序列数据库分类。I. 组成相似性搜索、局部相似性搜索和多序列比对的整合
Bioinformatics. 1998;14(2):164-73. doi: 10.1093/bioinformatics/14.2.164.
3
Definition of the tempo of sequence diversity across an alignment and automatic identification of sequence motifs: Application to protein homologous families and superfamilies.跨比对的序列多样性节奏定义及序列基序的自动识别:在蛋白质同源家族和超家族中的应用
Protein Sci. 2002 Dec;11(12):2825-35. doi: 10.1110/ps.0211202.
4
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
5
Use of a database of structural alignments and phylogenetic trees in investigating the relationship between sequence and structural variability among homologous proteins.利用结构比对和系统发育树数据库研究同源蛋白质序列与结构变异性之间的关系。
Protein Eng. 2001 Apr;14(4):219-26. doi: 10.1093/protein/14.4.219.
6
Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities.自动化蛋白质序列数据库分类。II. 从序列相似性描绘结构域边界
Bioinformatics. 1998;14(2):174-87. doi: 10.1093/bioinformatics/14.2.174.
7
Using CLUSTAL for multiple sequence alignments.使用CLUSTAL进行多序列比对。
Methods Enzymol. 1996;266:383-402. doi: 10.1016/s0076-6879(96)66024-8.
8
A map of the protein space--an automatic hierarchical classification of all protein sequences.蛋白质空间图谱——所有蛋白质序列的自动分层分类。
Proc Int Conf Intell Syst Mol Biol. 1998;6:212-21.
9
HOMSTRAD: a database of protein structure alignments for homologous families.HOMSTRAD:同源家族蛋白质结构比对数据库。
Protein Sci. 1998 Nov;7(11):2469-71. doi: 10.1002/pro.5560071126.
10
MACHOS: Markov clusters of homologous subsequences.MACHOS:同源子序列的马尔可夫聚类
Bioinformatics. 2008 Jul 1;24(13):i77-85. doi: 10.1093/bioinformatics/btn144.

引用本文的文献

1
Physicochemical property consensus sequences for functional analysis, design of multivalent antigens and targeted antivirals.理化性质共识序列用于功能分析、多价抗原设计和靶向抗病毒药物。
BMC Bioinformatics. 2012;13 Suppl 13(Suppl 13):S9. doi: 10.1186/1471-2105-13-S13-S9. Epub 2012 Aug 24.
2
Simplifying complex sequence information: a PCP-consensus protein binds antibodies against all four Dengue serotypes.简化复杂的序列信息:一种 PCP 共识蛋白可与所有四种登革热血清型的抗体结合。
Vaccine. 2012 Sep 14;30(42):6081-7. doi: 10.1016/j.vaccine.2012.07.042. Epub 2012 Jul 31.
3
The relative inefficiency of sequence weights approaches in determining a nucleotide position weight matrix.
序列权重方法在确定核苷酸位置权重矩阵方面相对低效。
Stat Appl Genet Mol Biol. 2005;4:Article13. doi: 10.2202/1544-6115.1135. Epub 2005 Jun 1.
4
Prediction of beta-barrel membrane proteins by searching for restricted domains.通过搜索受限结构域预测β-桶状膜蛋白
BMC Bioinformatics. 2005 Oct 14;6:254. doi: 10.1186/1471-2105-6-254.
5
A functional hierarchical organization of the protein sequence space.蛋白质序列空间的功能层次组织。
BMC Bioinformatics. 2004 Dec 14;5:196. doi: 10.1186/1471-2105-5-196.
6
Definition of the tempo of sequence diversity across an alignment and automatic identification of sequence motifs: Application to protein homologous families and superfamilies.跨比对的序列多样性节奏定义及序列基序的自动识别:在蛋白质同源家族和超家族中的应用
Protein Sci. 2002 Dec;11(12):2825-35. doi: 10.1110/ps.0211202.