May A C
Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 lAA, UK.
Protein Eng. 2001 Apr;14(4):209-17. doi: 10.1093/protein/14.4.209.
Hierarchical classification is probably the most popular approach to group related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number of subsets from a tree. In fact, classification of proteins in general is bedevilled with the need for subjective thresholds to define group membership (e.g., 'significant' sequence identity for homologous families). Such arbitrariness is not only intellectually unsatisfying but also has important practical consequences. For instance, it hinders meaningful identification of protein targets for structural genomics. I describe an alternative approach to cluster related proteins without the need for an a priori threshold: one, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity. Grouping proteins according to weights assigned to their aligned sequences makes it possible to delineate dynamically a 'core-periphery' structure within families. The 'core' of a protein family comprises the most typical sequences while the 'periphery' consists of the atypical ones. Further, a new sequence weighting scheme that combines the information in all the multiply aligned positions of an alignment in a novel way is put forward. Instead of averaging over all positions, this procedure takes into account directly the distribution of sequence variability along an alignment. The relationships between sequence weights and sequence identity are investigated for 168 families taken from HOMSTRAD, a database of protein structure alignments for homologous families. An exact solution is presented for the problem of how to select the most representative pair of sequences for a protein family. Extension of this approach by a greedy algorithm allows automatic identification of a minimal set of aligned sequences. The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.
层次分类可能是对相关蛋白质进行分组最常用的方法。然而,将其用于此目的存在一些问题。一个问题是,所得的显示分组嵌套序列的树可能不是数据的最合适表示形式。另一个问题是,目视检查是从树中确定最合适子集数量的最常用方法。实际上,一般来说,蛋白质分类因需要主观阈值来定义组成员身份(例如,同源家族的“显著”序列同一性)而受到困扰。这种随意性不仅在智力上不能令人满意,而且还具有重要的实际后果。例如,它阻碍了结构基因组学中蛋白质靶标的有意义识别。我描述了一种无需先验阈值即可对相关蛋白质进行聚类的替代方法:一种通过使用动态规划的方法,该方法保证在所有分区粒度级别上都能产生全局最优解。根据分配给其比对序列的权重对蛋白质进行分组,使得能够动态地描绘家族内的“核心 - 外围”结构。蛋白质家族的“核心”由最典型的序列组成,而“外围”则由非典型序列组成。此外,还提出了一种新的序列加权方案,该方案以一种新颖的方式组合了比对中所有多重比对位置的信息。此过程不是对所有位置求平均值,而是直接考虑沿比对的序列变异性分布。针对取自HOMSTRAD(一个同源家族蛋白质结构比对数据库)的168个家族,研究了序列权重与序列同一性之间的关系。针对如何为蛋白质家族选择最具代表性的一对序列的问题,给出了一个精确解。通过贪婪算法扩展此方法,可以自动识别一组最小的比对序列。此分析结果可在网页http://mathbio.nimr.mrc.ac.uk/~amay上获取。