Wattarujeekrit Tuangthong, Shah Parantu K, Collier Nigel
National Institute of Informatics, National Center of Sciences, Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo 101-8430, Japan.
BMC Bioinformatics. 2004 Oct 19;5:155. doi: 10.1186/1471-2105-5-155.
The exploitation of information extraction (IE), a technology aiming to provide instances of structured representations from free-form text, has been rapidly growing within the molecular biology (MB) research community to keep track of the latest results reported in literature. IE systems have traditionally used shallow syntactic patterns for matching facts in sentences but such approaches appear inadequate to achieve high accuracy in MB event extraction due to complex sentence structure. A consensus in the IE community is emerging on the necessity for exploiting deeper knowledge structures such as through the relations between a verb and its arguments shown by predicate-argument structure (PAS). PAS is of interest as structures typically correspond to events of interest and their participating entities. For this to be realized within IE a key knowledge component is the definition of PAS frames. PAS frames for non-technical domains such as newswire are already being constructed in several projects such as PropBank, VerbNet, and FrameNet. Knowledge from PAS should enable more accurate applications in several areas where sentence understanding is required like machine translation and text summarization. In this article, we explore the need to adapt PAS for the MB domain and specify PAS frames to support IE, as well as outlining the major issues that require consideration in their construction.
We introduce PASBio by extending a model based on PropBank to the MB domain. The hypothesis we explore is that PAS holds the key for understanding relationships describing the roles of genes and gene products in mediating their biological functions. We chose predicates describing gene expression, molecular interactions and signal transduction events with the aim of covering a number of research areas in MB. Analysis was performed on sentences containing a set of verbal predicates from MEDLINE and full text journals. Results confirm the necessity to analyze PAS specifically for MB domain.
At present PASBio contains the analyzed PAS of over 30 verbs, publicly available on the Internet for use in advanced applications. In the future we aim to expand the knowledge base to cover more verbs and the nominal form of each predicate.
信息提取(IE)技术旨在从自由文本中提供结构化表示的实例,在分子生物学(MB)研究社区中,其应用正在迅速发展,以跟踪文献中报道的最新结果。传统上,IE系统使用浅层句法模式来匹配句子中的事实,但由于句子结构复杂,这种方法在MB事件提取中似乎不足以实现高精度。IE社区正在形成一种共识,即有必要利用更深层次的知识结构,例如通过谓词-论元结构(PAS)所显示的动词与其论元之间的关系。PAS之所以受到关注,是因为其结构通常对应于感兴趣的事件及其参与实体。为了在IE中实现这一点,一个关键的知识组件是PAS框架的定义。诸如新闻专线等非技术领域的PAS框架已经在几个项目(如PropBank、VerbNet和FrameNet)中构建。来自PAS的知识应该能够在需要句子理解的几个领域(如机器翻译和文本摘要)中实现更准确的应用。在本文中,我们探讨了将PAS应用于MB领域的必要性,并指定了PAS框架以支持IE,同时概述了在其构建过程中需要考虑的主要问题。
我们通过将基于PropBank的模型扩展到MB领域来引入PASBio。我们探索的假设是,PAS是理解描述基因和基因产物在介导其生物学功能中作用的关系的关键。我们选择了描述基因表达、分子相互作用和信号转导事件的谓词,目的是涵盖MB中的多个研究领域。对包含来自MEDLINE和全文期刊的一组动词谓词的句子进行了分析。结果证实了专门针对MB领域分析PAS的必要性。
目前,PASBio包含了对30多个动词的分析后的PAS,可在互联网上公开获取,供高级应用使用。未来,我们旨在扩展知识库,以涵盖更多动词以及每个谓词的名词形式。