Abi-Haidar Alaa, Kaur Jasleen, Maguitman Ana, Radivojac Predrag, Rechtsteiner Andreas, Verspoor Karin, Wang Zhiping, Rocha Luis M
School of Informatics, Indiana University, Bloomington, IN 47405, USA.
Genome Biol. 2008;9 Suppl 2(Suppl 2):S11. doi: 10.1186/gb-2008-9-s2-s11. Epub 2008 Sep 1.
We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (interaction article subtask [IAS]), discovery of protein pairs (interaction pair subtask [IPS]), and identification of text passages characterizing protein interaction (interaction sentences subtask [ISS]) in full-text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam detection techniques, as well as an uncertainty-based integration scheme. We also used a support vector machine and singular value decomposition on the same features for comparison purposes. Our approach to the full-text subtasks (protein pair and passage identification) includes a feature expansion method based on word proximity networks.
Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of measures of performance used in the challenge evaluation (accuracy, F-score, and area under the receiver operating characteristic curve). We also report on a web tool that we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full-text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages.
Our approach to abstract classification shows that a simple linear model, using relatively few features, can generalize and uncover the conceptual nature of protein-protein interactions from the bibliome. Because the novel approach is based on a rather lightweight linear model, it can easily be ported and applied to similar problems. In full-text problems, the expansion of word features with word proximity networks is shown to be useful, although the need for some improvements is discussed.
我们参与了第二届生物创新挑战赛的三个蛋白质 - 蛋白质相互作用子任务:蛋白质 - 蛋白质相互作用相关摘要的分类(相互作用文章子任务[IAS])、蛋白质对的发现(相互作用对 子任务[IPS])以及全文文档中表征蛋白质相互作用的文本段落的识别(相互作用句子子任务[ISS])。我们采用了一种受垃圾邮件检测技术启发的新颖、轻量级线性模型以及基于不确定性的集成方案来处理摘要分类任务。为了进行比较,我们还在相同特征上使用了支持向量机和奇异值分解。我们处理全文子任务(蛋白质对和段落识别)的方法包括一种基于词邻近网络的特征扩展方法。
在挑战赛评估中使用的性能度量(准确率、F 值和接收器操作特征曲线下的面积)方面,我们处理摘要分类任务(IAS)的方法位列该任务的顶级提交结果之中。我们还报告了使用我们的方法制作的一个网络工具:蛋白质相互作用摘要相关性评估器(PIARE)。我们处理全文任务的方法获得了最高召回率之一以及正确段落的平均倒数排名。
我们的摘要分类方法表明,一个使用相对较少特征的简单线性模型能够从文献组中概括并揭示蛋白质 - 蛋白质相互作用的概念本质。由于这种新颖方法基于一个相当轻量级的线性模型,它可以很容易地移植并应用于类似问题。在全文问题中,尽管讨论了一些改进的必要性,但词邻近网络对词特征的扩展被证明是有用的。