Knowledge Management in Bioinformatics, Computer Science Department, Humboldt-Universität zu Berlin, 10099 Berlin, Germany.
BMC Bioinformatics. 2013 Jan 16;14:12. doi: 10.1186/1471-2105-14-12.
Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level.
We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance.
Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.
基于核的分类是从文本中提取相互作用蛋白质对(PPIs)的最新技术。已经提出了各种建议,特别是在特定核函数、输入表示类型和特征集方面存在差异。这些建议经常在不同的黄金标准语料库上比较它们的整体性能,但对于它们在实例级别上的各自性能知之甚少。
我们使用五个 PPI 语料库报告了对当前 13 种方法的共享特征和差异的详细分析。我们确定了大量相当困难(大多数方法都错误分类)和容易(大多数方法都正确分类)的 PPIs。我们表明,使用相同输入表示的核在这些对上表现相似,而使用不同核构建集成可以显著提高性能。然而,我们的分析还表明,困难对之间共享的特征很少,这降低了新方法(如果沿着与当前方法相同的路线构建)在提取性能方面取得突破的希望。
我们的实验表明,当前的方法似乎并不能很好地捕捉阳性 PPI 对的共享特征,这也归因于(仍然很少)可用语料库的异质性。我们的分析表明,性能改进应该更多地在新的特征集而不是新的核函数中寻找。