Suppr超能文献

使用上下文无关语法从非结构化文本中提取蛋白质相互作用信息。

Extraction of protein interaction information from unstructured text using a context-free grammar.

作者信息

Temkin Joshua M, Gilder Mark R

机构信息

GE Global Research, 1 Research Circle, Niskayuna, NY 12309, USA.

出版信息

Bioinformatics. 2003 Nov 1;19(16):2046-53. doi: 10.1093/bioinformatics/btg279.

Abstract

MOTIVATION

As research into disease pathology and cellular function continues to generate vast amounts of data pertaining to protein, gene and small molecule (PGSM) interactions, there exists a critical need to capture these results in structured formats allowing for computational analysis. Although many efforts have been made to create databases that store this information in computer readable form, populating these sources largely requires a manual process of interpreting and extracting interaction relationships from the biological research literature. Being able to efficiently and accurately automate the extraction of interactions from unstructured text, would greatly improve the content of these databases and provide a method for managing the continued growth of new literature being published.

RESULTS

In this paper, we describe a system for extracting PGSM interactions from unstructured text. By utilizing a lexical analyzer and context free grammar (CFG), we demonstrate that efficient parsers can be constructed for extracting these relationships from natural language with high rates of recall and precision. Our results show that this technique achieved a recall rate of 83.5% and a precision rate of 93.1% for recognizing PGSM names and a recall rate of 63.9% and a precision rate of 70.2% for extracting interactions between these entities. In contrast to other published techniques, the use of a CFG significantly reduces the complexities of natural language processing by focusing on domain specific structure as opposed to analyzing the semantics of a given language. Additionally, our approach provides a level of abstraction for adding new rules for extracting other types of biological relationships beyond PGSM relationships.

AVAILABILITY

The program and corpus are available by request from the authors.

摘要

动机

随着对疾病病理学和细胞功能的研究不断产生大量与蛋白质、基因和小分子(PGSM)相互作用相关的数据,迫切需要以结构化格式捕获这些结果,以便进行计算分析。尽管已经做出了许多努力来创建以计算机可读形式存储这些信息的数据库,但填充这些数据源在很大程度上需要一个从生物学研究文献中解释和提取相互作用关系的手动过程。能够高效、准确地自动从非结构化文本中提取相互作用,将大大改善这些数据库的内容,并提供一种管理新发表文献持续增长的方法。

结果

在本文中,我们描述了一种从非结构化文本中提取PGSM相互作用的系统。通过使用词法分析器和上下文无关语法(CFG),我们证明可以构建高效的解析器,以高召回率和精确率从自然语言中提取这些关系。我们的结果表明,该技术在识别PGSM名称方面的召回率为83.5%,精确率为93.1%,在提取这些实体之间的相互作用方面的召回率为63.9%,精确率为70.2%。与其他已发表的技术相比,使用CFG通过关注特定领域的结构而不是分析给定语言的语义,显著降低了自然语言处理的复杂性。此外,我们的方法提供了一定程度的抽象,以便添加用于提取PGSM关系之外的其他类型生物学关系的新规则。

可用性

程序和语料库可根据作者的要求提供。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验