自然语言处理中领域知识与语言结构的相互作用：解读生物医学文本中的上位命题

The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text.

作者信息

Rindflesch Thomas C, Fiszman Marcelo

机构信息

Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda, MD 20894, USA.

出版信息

J Biomed Inform. 2003 Dec;36(6):462-77. doi: 10.1016/j.jbi.2003.11.003.

DOI:10.1016/j.jbi.2003.11.003

PMID:14759819

Abstract

Interpretation of semantic propositions in free-text documents such as MEDLINE citations would provide valuable support for biomedical applications, and several approaches to semantic interpretation are being pursued in the biomedical informatics community. In this paper, we describe a methodology for interpreting linguistic structures that encode hypernymic propositions, in which a more specific concept is in a taxonomic relationship with a more general concept. In order to effectively process these constructions, we exploit underspecified syntactic analysis and structured domain knowledge from the Unified Medical Language System (UMLS). After introducing the syntactic processing on which our system depends, we focus on the UMLS knowledge that supports interpretation of hypernymic propositions. We first use semantic groups from the Semantic Network to ensure that the two concepts involved are compatible; hierarchical information in the Metathesaurus then determines which concept is more general and which more specific. A preliminary evaluation of a sample based on the semantic group Chemicals and Drugs provides 83% precision. An error analysis was conducted and potential solutions to the problems encountered are presented. The research discussed here serves as a paradigm for investigating the interaction between domain knowledge and linguistic structure in natural language processing, and could also make a contribution to research on automatic processing of discourse structure. Additional implications of the system we present include its integration in advanced semantic interpretation processors for biomedical text and its use for information extraction in specific domains. The approach has the potential to support a range of applications, including information retrieval and ontology engineering.

摘要

对诸如MEDLINE引文等自由文本文件中的语义命题进行解释，将为生物医学应用提供有价值的支持，生物医学信息学界正在探索多种语义解释方法。在本文中，我们描述了一种解释编码上位命题的语言结构的方法，其中一个更具体的概念与一个更一般的概念存在分类学关系。为了有效处理这些结构，我们利用了未确定的句法分析和来自统一医学语言系统（UMLS）的结构化领域知识。在介绍了我们系统所依赖的句法处理之后，我们重点关注支持上位命题解释的UMLS知识。我们首先使用语义网络中的语义组来确保所涉及的两个概念是兼容的；元词表中的层次信息随后确定哪个概念更一般，哪个概念更具体。基于“化学品和药物”语义组的样本进行的初步评估提供了83%的精度。我们进行了错误分析，并提出了遇到问题的潜在解决方案。这里讨论的研究为调查自然语言处理中领域知识与语言结构之间的相互作用提供了一个范例，也可能为语篇结构的自动处理研究做出贡献。我们提出的系统的其他意义包括将其集成到用于生物医学文本的高级语义解释处理器中，以及将其用于特定领域的信息提取。该方法有可能支持一系列应用，包括信息检索和本体工程。