Astikainen Katja, Holm Liisa, Pitkänen Esa, Szedmak Sandor, Rousu Juho
Department of Computer Science, PO Box 68, FI-00014 University of Helsinki, Finland.
BMC Proc. 2008 Dec 17;2 Suppl 4(Suppl 4):S2. doi: 10.1186/1753-6561-2-s4-s2.
In this paper we describe work in progress in developing kernel methods for enzyme function prediction. Our focus is in developing so called structured output prediction methods, where the enzymatic reaction is the combinatorial target object for prediction. We compared two structured output prediction methods, the Hierarchical Max-Margin Markov algorithm (HM3) and the Maximum Margin Regression algorithm (MMR) in hierarchical classification of enzyme function. As sequence features we use various string kernels and the GTG feature set derived from the global alignment trace graph of protein sequences.
In our experiments, in predicting enzyme EC classification we obtain over 85% accuracy (predicting the four digit EC code) and over 91% microlabel F1 score (predicting individual EC digits). In predicting the Gold Standard enzyme families, we obtain over 79% accuracy (predicting family correctly) and over 89% microlabel F1 score (predicting superfamilies and families). In the latter case, structured output methods are significantly more accurate than nearest neighbor classifier. A polynomial kernel over the GTG feature set turned out to be a prerequisite for accurate function prediction. Combining GTG with string kernels boosted accuracy slightly in the case of EC class prediction.
Structured output prediction with GTG features is shown to be computationally feasible and to have accuracy on par with state-of-the-art approaches in enzyme function prediction.
在本文中,我们描述了在开发用于酶功能预测的核方法方面正在进行的工作。我们的重点是开发所谓的结构化输出预测方法,其中酶促反应是预测的组合目标对象。我们在酶功能的层次分类中比较了两种结构化输出预测方法,即层次最大边际马尔可夫算法(HM3)和最大边际回归算法(MMR)。作为序列特征,我们使用各种字符串核以及从蛋白质序列的全局比对迹线图派生的GTG特征集。
在我们的实验中,在预测酶的EC分类时,我们获得了超过85%的准确率(预测四位数字的EC代码)和超过91%的微标签F1分数(预测单个EC数字)。在预测金标准酶家族时,我们获得了超过79%的准确率(正确预测家族)和超过89%的微标签F1分数(预测超家族和家族)。在后一种情况下,结构化输出方法比最近邻分类器明显更准确。事实证明,基于GTG特征集的多项式核是准确功能预测的先决条件。在EC类预测的情况下,将GTG与字符串核相结合可略微提高准确率。
使用GTG特征的结构化输出预测在计算上是可行的,并且在酶功能预测方面具有与现有最先进方法相当的准确率。