Strope Pooja K, Moriyama Etsuko N
Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0660,
Genomics. 2007 May;89(5):602-12. doi: 10.1016/j.ygeno.2007.01.008. Epub 2007 Mar 2.
Computational methods of predicting protein functions rely on detecting similarities among proteins. However, sufficient sequence information is not always available for some protein families. For example, proteins of interest may be new members of a divergent protein family. The performance of protein classification methods could vary in such challenging situations. Using the G-protein-coupled receptor superfamily as an example, we investigated the performance of several protein classifiers. Alignment-free classifiers based on support vector machines using simple amino acid compositions were effective in remote-similarity detection even from short fragmented sequences. Although it is computationally expensive, a support vector machine classifier using local pairwise alignment scores showed very good balanced performance. More commonly used profile hidden Markov models were generally highly specific and well suited to classifying well-established protein family members. It is suggested that different types of protein classifiers should be applied to gain the optimal mining power.
预测蛋白质功能的计算方法依赖于检测蛋白质之间的相似性。然而,对于某些蛋白质家族来说,并非总能获得足够的序列信息。例如,感兴趣的蛋白质可能是一个分化蛋白质家族的新成员。在这种具有挑战性的情况下,蛋白质分类方法的性能可能会有所不同。以G蛋白偶联受体超家族为例,我们研究了几种蛋白质分类器的性能。基于支持向量机并使用简单氨基酸组成的无比对分类器,即使从短片段序列中也能有效地进行远程相似性检测。尽管计算成本很高,但使用局部两两比对得分的支持向量机分类器表现出非常好的平衡性能。更常用的轮廓隐马尔可夫模型通常具有高度特异性,非常适合对已确立的蛋白质家族成员进行分类。建议应应用不同类型的蛋白质分类器以获得最佳挖掘能力。