Vilim R B, Cunningham R M, Lu B, Kheradpour P, Stevens F J
Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439, USA.
Bioinformatics. 2004 Apr 12;20(6):847-53. doi: 10.1093/bioinformatics/btg492. Epub 2004 Feb 5.
Methods that focus on secondary structures, such as Position Specific Scoring Matrices and Hidden Markov Models, have proved useful for assigning proteins to families. However, for assigning proteins to an attribute class within a family these methods may introduce more free parameters than are needed. There are fewer members and there is less variability among sequences within a family. We describe a method for organizing proteins in a family that exhibits up to an order of magnitude reduction in the number of parameters. The basis is the log odds ratio commonly used to measure similarity. We adapt this to characterize the sequence dissimilarities that give rise to attribute differentiation. This leads to the definition of Class Attribute Substitution Matrices (CLASSUM), a dual of the BLOSUM.
The method was applied to classify sequences hierarchically in the lambda and kappa subgroups of the immunoglobulin superfamily. Positions conferring class were identified based on the degree of amino acid variability at a position. The CLASSUM computed for these positions classified better than 90% of test data correctly compared with 35-50% for BLOSUM-62. The expected value for a random matrix is 14%. The results suggest that family-specific data-derived substitution matrices can improve the resolution of automated methods that use generic substitution matrices for searching for and classifying proteins.
诸如位置特异性得分矩阵和隐马尔可夫模型等关注二级结构的方法已被证明在将蛋白质归类到家族中很有用。然而,对于将蛋白质归类到家族内的属性类别,这些方法可能会引入比所需更多的自由参数。家族中的成员较少,序列间的变异性也较小。我们描述了一种在家族中组织蛋白质的方法,该方法可使参数数量减少多达一个数量级。其基础是常用于衡量相似性的对数优势比。我们对其进行调整以表征导致属性分化的序列差异。这就引出了类属性替换矩阵(CLASSUM)的定义,它是BLOSUM的对偶矩阵。
该方法被应用于对免疫球蛋白超家族的λ和κ亚组中的序列进行层次分类。根据某一位置氨基酸的可变程度确定赋予类别的位置。针对这些位置计算的CLASSUM能正确分类超过90%的测试数据,而BLOSUM - 62的正确率为35 - 50%。随机矩阵的预期值为14%。结果表明,源自家族特异性数据的替换矩阵可以提高使用通用替换矩阵来搜索和分类蛋白质的自动化方法的分辨率。