使用上下文无关语法从非结构化文本中提取蛋白质相互作用信息。

Extraction of protein interaction information from unstructured text using a context-free grammar.

作者信息

Temkin Joshua M, Gilder Mark R

机构信息

GE Global Research, 1 Research Circle, Niskayuna, NY 12309, USA.

出版信息

Bioinformatics. 2003 Nov 1;19(16):2046-53. doi: 10.1093/bioinformatics/btg279.

DOI:10.1093/bioinformatics/btg279

PMID:14594709

Abstract

MOTIVATION

As research into disease pathology and cellular function continues to generate vast amounts of data pertaining to protein, gene and small molecule (PGSM) interactions, there exists a critical need to capture these results in structured formats allowing for computational analysis. Although many efforts have been made to create databases that store this information in computer readable form, populating these sources largely requires a manual process of interpreting and extracting interaction relationships from the biological research literature. Being able to efficiently and accurately automate the extraction of interactions from unstructured text, would greatly improve the content of these databases and provide a method for managing the continued growth of new literature being published.

RESULTS

In this paper, we describe a system for extracting PGSM interactions from unstructured text. By utilizing a lexical analyzer and context free grammar (CFG), we demonstrate that efficient parsers can be constructed for extracting these relationships from natural language with high rates of recall and precision. Our results show that this technique achieved a recall rate of 83.5% and a precision rate of 93.1% for recognizing PGSM names and a recall rate of 63.9% and a precision rate of 70.2% for extracting interactions between these entities. In contrast to other published techniques, the use of a CFG significantly reduces the complexities of natural language processing by focusing on domain specific structure as opposed to analyzing the semantics of a given language. Additionally, our approach provides a level of abstraction for adding new rules for extracting other types of biological relationships beyond PGSM relationships.

AVAILABILITY

The program and corpus are available by request from the authors.

摘要

动机

随着对疾病病理学和细胞功能的研究不断产生大量与蛋白质、基因和小分子（PGSM）相互作用相关的数据，迫切需要以结构化格式捕获这些结果，以便进行计算分析。尽管已经做出了许多努力来创建以计算机可读形式存储这些信息的数据库，但填充这些数据源在很大程度上需要一个从生物学研究文献中解释和提取相互作用关系的手动过程。能够高效、准确地自动从非结构化文本中提取相互作用，将大大改善这些数据库的内容，并提供一种管理新发表文献持续增长的方法。

结果

在本文中，我们描述了一种从非结构化文本中提取PGSM相互作用的系统。通过使用词法分析器和上下文无关语法（CFG），我们证明可以构建高效的解析器，以高召回率和精确率从自然语言中提取这些关系。我们的结果表明，该技术在识别PGSM名称方面的召回率为83.5%，精确率为93.1%，在提取这些实体之间的相互作用方面的召回率为63.9%，精确率为70.2%。与其他已发表的技术相比，使用CFG通过关注特定领域的结构而不是分析给定语言的语义，显著降低了自然语言处理的复杂性。此外，我们的方法提供了一定程度的抽象，以便添加用于提取PGSM关系之外的其他类型生物学关系的新规则。

可用性

程序和语料库可根据作者的要求提供。

相似文献

Extraction of protein interaction information from unstructured text using a context-free grammar.

Bioinformatics. 2003 Nov 1;19(16):2046-53. doi: 10.1093/bioinformatics/btg279.

Discovering patterns to extract protein-protein interactions from full texts.

Bioinformatics. 2004 Dec 12;20(18):3604-12. doi: 10.1093/bioinformatics/bth451. Epub 2004 Jul 29.

Information extraction from biomedical text.

J Biomed Inform. 2002 Aug;35(4):260-4. doi: 10.1016/s1532-0464(03)00015-7.

Extracting human protein interactions from MEDLINE using a full-sentence parser.

Bioinformatics. 2004 Mar 22;20(5):604-11. doi: 10.1093/bioinformatics/btg452. Epub 2004 Jan 22.

Automatic extraction of gene/protein biological functions from biomedical text.

Bioinformatics. 2005 Apr 1;21(7):1227-36. doi: 10.1093/bioinformatics/bti084. Epub 2004 Oct 27.

Finding the evidence for protein-protein interactions from PubMed abstracts.

Bioinformatics. 2006 Jul 15;22(14):e220-6. doi: 10.1093/bioinformatics/btl203.

Literature mining and database annotation of protein phosphorylation using a rule-based system.

Bioinformatics. 2005 Jun 1;21(11):2759-65. doi: 10.1093/bioinformatics/bti390. Epub 2005 Apr 6.

Annotating proteins by mining protein interaction networks.

Bioinformatics. 2006 Jul 15;22(14):e260-70. doi: 10.1093/bioinformatics/btl221.

Discovering patterns to extract protein-protein interactions from the literature: Part II.

Bioinformatics. 2005 Aug 1;21(15):3294-300. doi: 10.1093/bioinformatics/bti493. Epub 2005 May 12.

RelEx--relation extraction using dependency parse trees.

Bioinformatics. 2007 Feb 1;23(3):365-71. doi: 10.1093/bioinformatics/btl616. Epub 2006 Dec 1.

引用本文的文献

Identification of all-against-all protein-protein interactions based on deep hash learning.

BMC Bioinformatics. 2022 Jul 8;23(1):266. doi: 10.1186/s12859-022-04811-x.

A pre-training and self-training approach for biomedical named entity recognition.

PLoS One. 2021 Feb 9;16(2):e0246310. doi: 10.1371/journal.pone.0246310. eCollection 2021.

Automatic extraction of protein-protein interactions using grammatical relationship graph.

BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):42. doi: 10.1186/s12911-018-0628-4.

Natural language processing in text mining for structural modeling of protein complexes.

BMC Bioinformatics. 2018 Mar 5;19(1):84. doi: 10.1186/s12859-018-2079-4.

Identifying genotype-phenotype relationships in biomedical text.

J Biomed Semantics. 2017 Dec 6;8(1):57. doi: 10.1186/s13326-017-0163-8.

The Interaction Network Ontology-supported modeling and mining of complex interactions represented with multiple keywords in biomedical literature.

BioData Min. 2016 Dec 19;9:41. doi: 10.1186/s13040-016-0118-0. eCollection 2016.

Identifying Liver Cancer and Its Relations with Diseases, Drugs, and Genes: A Literature-Based Approach.

PLoS One. 2016 May 19;11(5):e0156091. doi: 10.1371/journal.pone.0156091. eCollection 2016.

Text Mining for Protein Docking.

PLoS Comput Biol. 2015 Dec 9;11(12):e1004630. doi: 10.1371/journal.pcbi.1004630. eCollection 2015 Dec.

Survey of Natural Language Processing Techniques in Bioinformatics.

Comput Math Methods Med. 2015;2015:674296. doi: 10.1155/2015/674296. Epub 2015 Oct 7.

Sequential pattern mining for discovering gene interactions and their contextual information from biomedical texts.

J Biomed Semantics. 2015 May 18;6:27. doi: 10.1186/s13326-015-0023-3. eCollection 2015.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用上下文无关语法从非结构化文本中提取蛋白质相互作用信息。

Extraction of protein interaction information from unstructured text using a context-free grammar.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献