Aydın Ferhat, Hüsünbeyi Zehra Melce, Özgür Arzucan
Department of Computer Engineering, Boğaziçi University, TR-34342 Bebek, Istanbul, Turkey.
Department of Computer Engineering, Boğaziçi University, TR-34342 Bebek, Istanbul, Turkey
Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw166. Print 2017.
Information regarding the physical interactions among proteins is crucial, since protein-protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative-Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency-relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article.Database URL: https://github.com/ferhtaydn/biocemid/.
有关蛋白质之间物理相互作用的信息至关重要,因为蛋白质 - 蛋白质相互作用(PPI)是许多生物过程的核心。用于验证PPI的实验技术对于表征和评估所鉴定PPI的可靠性至关重要。许多关于PPI和实验方法的信息仅存在于报告它们的科学出版物文本中。在本研究中,我们将识别蛋白质之间物理相互作用实验方法段落的问题作为信息检索搜索任务来处理。基线系统基于查询匹配,其中查询是通过利用蛋白质组学标准倡议 - 分子相互作用(PSI - MI)本体中的实验方法名称(包括同义词)生成的。我们提出了两种方法,通过纳入额外的相关术语来扩展基线查询。第一种方法是一种监督方法,其中通过对来自我们公开提供的30篇全文文章的手动注释数据集中的13篇文章使用词频 - 相关频率(tf.rf)度量来获得每种实验方法的最显著术语。另一方面,第二种方法是一种无监督方法,其中通过使用PSI - MI本体中实验方法名称的词嵌入来扩展每种实验方法的查询。词嵌入是通过利用大型未标记全文语料库获得的。所提出的方法在由17篇文章组成的测试集上进行评估。与基线相比,两种方法都获得了更高的召回率分数,但精度有所损失。除了更高的召回率外,基于词嵌入的方法比基线和基于tf.rf的方法实现了更高的F值。我们还表明,纳入基因名称和相互作用关键词识别可提高所有三种评估方法的精度和F值分数。基于tf.rf的方法是作为我们参与BioCreative V挑战评估的协作生物编目助手任务的一部分而开发的,而基于词嵌入的方法是本文的一项新贡献。数据库网址:https://github.com/ferhtaydn/biocemid/