Hannenhalli S S, Russell R B
Bioinformatics Research Group, SmithKline Beecham Pharmaceuticals Research & Development, 709 Swedeland Road, King of Prussia, PA 19406, USA.
J Mol Biol. 2000 Oct 13;303(1):61-76. doi: 10.1006/jmbi.2000.4036.
The increasing number and diversity of protein sequence families requires new methods to define and predict details regarding function. Here, we present a method for analysis and prediction of functional sub-types from multiple protein sequence alignments. Given an alignment and set of proteins grouped into sub-types according to some definition of function, such as enzymatic specificity, the method identifies positions that are indicative of functional differences by comparison of sub-type specific sequence profiles, and analysis of positional entropy in the alignment. Alignment positions with significantly high positional relative entropy correlate with those known to be involved in defining sub-types for nucleotidyl cyclases, protein kinases, lactate/malate dehydrogenases and trypsin-like serine proteases. We highlight new positions for these proteins that suggest additional experiments to elucidate the basis of specificity. The method is also able to predict sub-type for unclassified sequences. We assess several variations on a prediction method, and compare them to simple sequence comparisons. For assessment, we remove close homologues to the sequence for which a prediction is to be made (by a sequence identity above a threshold). This simulates situations where a protein is known to belong to a protein family, but is not a close relative of another protein of known sub-type. Considering the four families above, and a sequence identity threshold of 30 %, our best method gives an accuracy of 96 % compared to 80 % obtained for sequence similarity and 74 % for BLAST. We describe the derivation of a set of sub-type groupings derived from an automated parsing of alignments from PFAM and the SWISSPROT database, and use this to perform a large-scale assessment. The best method gives an average accuracy of 94 % compared to 68 % for sequence similarity and 79 % for BLAST. We discuss implications for experimental design, genome annotation and the prediction of protein function and protein intra-residue distances.
蛋白质序列家族数量的不断增加及其多样性,需要新的方法来定义和预测功能细节。在此,我们提出一种从多个蛋白质序列比对中分析和预测功能子类型的方法。给定一个比对以及根据某些功能定义(如酶特异性)分组为子类型的一组蛋白质,该方法通过比较子类型特异性序列谱并分析比对中的位置熵,来识别指示功能差异的位置。具有显著高位置相对熵的比对位置与已知参与定义核苷酸环化酶、蛋白激酶、乳酸/苹果酸脱氢酶和胰蛋白酶样丝氨酸蛋白酶子类型的位置相关。我们突出了这些蛋白质的新位置,这表明需要进行额外实验以阐明特异性的基础。该方法还能够预测未分类序列的子类型。我们评估了预测方法的几种变体,并将它们与简单序列比较进行比较。为了进行评估,我们去除与要进行预测的序列的紧密同源物(通过高于阈值的序列同一性)。这模拟了已知一种蛋白质属于一个蛋白质家族,但不是已知子类型的另一种蛋白质的近亲的情况。考虑到上述四个家族以及30%的序列同一性阈值,我们最好的方法准确率为96%,相比之下,序列相似性为80%,BLAST为74%。我们描述了从PFAM和SWISSPROT数据库比对的自动解析中得出的一组子类型分组的推导过程,并使用此进行大规模评估。最好的方法平均准确率为94%,相比之下,序列相似性为68%,BLAST为79%。我们讨论了对实验设计、基因组注释以及蛋白质功能和蛋白质残基内距离预测的影响。