使用新型线性模型和词邻近网络揭示摘要和文本中的蛋白质相互作用。

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks.

作者信息

Abi-Haidar Alaa, Kaur Jasleen, Maguitman Ana, Radivojac Predrag, Rechtsteiner Andreas, Verspoor Karin, Wang Zhiping, Rocha Luis M

机构信息

School of Informatics, Indiana University, Bloomington, IN 47405, USA.

出版信息

Genome Biol. 2008;9 Suppl 2(Suppl 2):S11. doi: 10.1186/gb-2008-9-s2-s11. Epub 2008 Sep 1.

DOI:10.1186/gb-2008-9-s2-s11

PMID:18834489

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2559982/

Abstract

BACKGROUND

We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (interaction article subtask [IAS]), discovery of protein pairs (interaction pair subtask [IPS]), and identification of text passages characterizing protein interaction (interaction sentences subtask [ISS]) in full-text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam detection techniques, as well as an uncertainty-based integration scheme. We also used a support vector machine and singular value decomposition on the same features for comparison purposes. Our approach to the full-text subtasks (protein pair and passage identification) includes a feature expansion method based on word proximity networks.

RESULTS

Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of measures of performance used in the challenge evaluation (accuracy, F-score, and area under the receiver operating characteristic curve). We also report on a web tool that we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full-text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages.

CONCLUSION

Our approach to abstract classification shows that a simple linear model, using relatively few features, can generalize and uncover the conceptual nature of protein-protein interactions from the bibliome. Because the novel approach is based on a rather lightweight linear model, it can easily be ported and applied to similar problems. In full-text problems, the expansion of word features with word proximity networks is shown to be useful, although the need for some improvements is discussed.

摘要

背景

我们参与了第二届生物创新挑战赛的三个蛋白质 - 蛋白质相互作用子任务：蛋白质 - 蛋白质相互作用相关摘要的分类（相互作用文章子任务[IAS]）、蛋白质对的发现（相互作用对子任务[IPS]）以及全文文档中表征蛋白质相互作用的文本段落的识别（相互作用句子子任务[ISS]）。我们采用了一种受垃圾邮件检测技术启发的新颖、轻量级线性模型以及基于不确定性的集成方案来处理摘要分类任务。为了进行比较，我们还在相同特征上使用了支持向量机和奇异值分解。我们处理全文子任务（蛋白质对和段落识别）的方法包括一种基于词邻近网络的特征扩展方法。

结果

在挑战赛评估中使用的性能度量（准确率、F 值和接收器操作特征曲线下的面积）方面，我们处理摘要分类任务（IAS）的方法位列该任务的顶级提交结果之中。我们还报告了使用我们的方法制作的一个网络工具：蛋白质相互作用摘要相关性评估器（PIARE）。我们处理全文任务的方法获得了最高召回率之一以及正确段落的平均倒数排名。

结论

我们的摘要分类方法表明，一个使用相对较少特征的简单线性模型能够从文献组中概括并揭示蛋白质 - 蛋白质相互作用的概念本质。由于这种新颖方法基于一个相当轻量级的线性模型，它可以很容易地移植并应用于类似问题。在全文问题中，尽管讨论了一些改进的必要性，但词邻近网络对词特征的扩展被证明是有用的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba72/2559982/a4b09dac7dc1/gb-2008-9-s2-s11-1.jpg

相似文献

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S11. doi: 10.1186/gb-2008-9-s2-s11. Epub 2008 Sep 1.

Classification of protein-protein interaction full-text documents using text and citation network features.

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):400-11. doi: 10.1109/TCBB.2010.55.

Overview of the protein-protein interaction annotation extraction task of BioCreative II.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S4. doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.

A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S12. doi: 10.1186/1471-2105-12-S8-S12.

Empirical investigations into full-text protein interaction Article Categorization Task (ACT) in the BioCreative II.5 Challenge.

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):421-7. doi: 10.1109/TCBB.2010.49.

Mining physical protein-protein interactions from the literature.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S12. doi: 10.1186/gb-2008-9-s2-s12. Epub 2008 Sep 1.

Building a protein name dictionary from full text: a machine learning term extraction approach.

BMC Bioinformatics. 2005 Apr 7;6:88. doi: 10.1186/1471-2105-6-88.

Biomedical named entity recognition using two-phase model based on SVMs.

J Biomed Inform. 2004 Dec;37(6):436-47. doi: 10.1016/j.jbi.2004.08.012.

Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Two-phase biomedical named entity recognition using CRFs.

Comput Biol Chem. 2009 Aug;33(4):334-8. doi: 10.1016/j.compbiolchem.2009.07.004. Epub 2009 Aug 4.

引用本文的文献

The distance backbone of complex networks.

J Complex Netw. 2021 Dec;9(6). doi: 10.1093/comnet/cnab021. Epub 2021 Oct 20.

Protein-Protein Interaction Article Classification Using a Convolutional Recurrent Neural Network with Pre-trained Word Embeddings.

J Integr Bioinform. 2017 Dec 13;14(4):/j/jib.2017.14.issue-4/jib-2017-0055/jib-2017-0055.xml. doi: 10.1515/jib-2017-0055.

Construction of antimicrobial peptide-drug combination networks from scientific literature based on a semi-automated curation workflow.

Database (Oxford). 2016 Dec 26;2016. doi: 10.1093/database/baw143. Print 2016.

MONITORING POTENTIAL DRUG INTERACTIONS AND REACTIONS VIA NETWORK ANALYSIS OF INSTAGRAM USER TIMELINES.

Pac Symp Biocomput. 2016;21:492-503.

Extraction of pharmacokinetic evidence of drug-drug interactions from the literature.

PLoS One. 2015 May 11;10(5):e0122199. doi: 10.1371/journal.pone.0122199. eCollection 2015.

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-12-S8-S3.

A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S12. doi: 10.1186/1471-2105-12-S8-S12.

Overview of the protein-protein interaction annotation extraction task of BioCreative II.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S4. doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.

本文引用的文献

IntAct--open source resource for molecular interaction data.

Nucleic Acids Res. 2007 Jan;35(Database issue):D561-5. doi: 10.1093/nar/gkl958. Epub 2006 Dec 1.

The Universal Protein Resource (UniProt).

Nucleic Acids Res. 2007 Jan;35(Database issue):D193-7. doi: 10.1093/nar/gkl929. Epub 2006 Nov 16.

MINT: the Molecular INTeraction database.

Nucleic Acids Res. 2007 Jan;35(Database issue):D572-4. doi: 10.1093/nar/gkl950. Epub 2006 Nov 29.

Large-scale testing of bibliome informatics using Pfam protein families.

Pac Symp Biocomput. 2006:76-87.

Literature mining for the biologist: from information retrieval to biological discovery.

Nat Rev Genet. 2006 Feb;7(2):119-29. doi: 10.1038/nrg1768.

Protein annotation as term categorization in the gene ontology using word proximity networks.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S20. doi: 10.1186/1471-2105-6-S1-S20. Epub 2005 May 24.

Overview of BioCreAtIvE: critical assessment of information extraction for biology.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S1. doi: 10.1186/1471-2105-6-S1-S1. Epub 2005 May 24.

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.

Bioinformatics. 2005 Jul 15;21(14):3191-2. doi: 10.1093/bioinformatics/bti475. Epub 2005 Apr 28.

Mining the biomedical literature in the genomic era: an overview.

J Comput Biol. 2003;10(6):821-55. doi: 10.1089/106652703322756104.

MIPS: analysis and annotation of proteins from whole genomes.

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D41-4. doi: 10.1093/nar/gkh092.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用新型线性模型和词邻近网络揭示摘要和文本中的蛋白质相互作用。

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks.

作者信息

Abi-Haidar Alaa, Kaur Jasleen, Maguitman Ana, Radivojac Predrag, Rechtsteiner Andreas, Verspoor Karin, Wang Zhiping, Rocha Luis M

机构信息

School of Informatics, Indiana University, Bloomington, IN 47405, USA.