Tao Tao, Zhai Cheng Xiang, Lu Xinghua, Fang Hui
Department of Computer Science, University of Illinois at Urbana-Champaign, 506 S. Matthews Avenue, Urbana, IL 61801, USA.
Appl Bioinformatics. 2004;3(2-3):115-24. doi: 10.2165/00822942-200403020-00006.
Automatic discovery of new protein motifs (i.e. amino acid patterns) is one of the major challenges in bioinformatics. Several algorithms have been proposed that can extract statistically significant motif patterns from any set of protein sequences. With these methods, one can generate a large set of candidate motifs that may be biologically meaningful. This article examines methods to predict the functions of these candidate motifs. We use several statistical methods: a popularity method, a mutual information method and probabilistic translation models. These methods capture, from different perspectives, the correlations between the matched motifs of a protein and its assigned Gene Ontology terms that characterise the function of the protein. We evaluate these different methods using the known motifs in the InterPro database. Each method is used to rank candidate terms for each motif. We then use the expected mean reciprocal rank to evaluate the performance. The results show that, in general, all these methods perform well, suggesting that they can all be useful for predicting the function of an unknown motif. Among the methods tested, a probabilistic translation model with a popularity prior performs the best.
自动发现新的蛋白质基序(即氨基酸模式)是生物信息学中的主要挑战之一。已经提出了几种算法,它们可以从任何蛋白质序列集中提取具有统计学意义的基序模式。使用这些方法,可以生成一大组可能具有生物学意义的候选基序。本文研究预测这些候选基序功能的方法。我们使用几种统计方法:流行度方法、互信息方法和概率翻译模型。这些方法从不同角度捕捉蛋白质的匹配基序与其指定的表征蛋白质功能的基因本体术语之间的相关性。我们使用InterPro数据库中的已知基序评估这些不同方法。每种方法用于对每个基序的候选术语进行排名。然后我们使用预期平均倒数排名来评估性能。结果表明,总体而言,所有这些方法都表现良好,这表明它们都可用于预测未知基序的功能。在所测试的方法中,具有流行度先验的概率翻译模型表现最佳。