Groza Tudor, Verspoor Karin
School of Information Technology and Electrical Engineering, The University of Queensland, St Lucia, Australia.
Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia; Health and Biomedical Informatics Centre, The University of Melbourne, Melbourne, Australia.
PLoS One. 2015 Mar 19;10(3):e0119091. doi: 10.1371/journal.pone.0119091. eCollection 2015.
Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their underlying mechanisms vary from shallow natural language processing and dictionary lookup to specialized machine learning modules. However, no prior approach considers the case sensitivity characteristics and the term distribution of the underlying ontology on the CR process. This article proposes a framework that models the CR process as an information retrieval task in which both case sensitivity and the information gain associated with tokens in lexical representations (e.g., term labels, synonyms) are central components of a strategy for generating term variants. The case sensitivity of a given ontology is assessed based on the distribution of so-called case sensitive tokens in its terms, while information gain is modelled using a combination of divergence from randomness and mutual information. An extensive evaluation has been carried out using the CRAFT corpus. Experimental results show that case sensitivity awareness leads to an increase of up to 0.07 F1 against a non-case sensitive baseline on the Protein Ontology and GO Cellular Component. Similarly, the use of information gain leads to an increase of up to 0.06 F1 against a standard baseline in the case of GO Biological Process and Molecular Function and GO Cellular Component. Overall, subject to the underlying token distribution, these methods lead to valid complementary strategies for augmenting term label sets to improve concept recognition.
概念识别(CR)是生物医学领域的一项基础任务。它支持将非结构化资源转化为结构化知识这一重要过程。迄今为止,已经提出了几种概念识别方法,其中大多数方法专注于特定的生物医学本体集。它们的底层机制各不相同,从浅层自然语言处理和字典查找到专门的机器学习模块。然而,之前没有任何一种方法在概念识别过程中考虑到基础本体的大小写敏感性特征和术语分布。本文提出了一个框架,将概念识别过程建模为一个信息检索任务,其中大小写敏感性以及与词汇表示中的词元(例如术语标签、同义词)相关的信息增益都是生成术语变体策略的核心组成部分。根据给定本体中所谓大小写敏感词元的分布来评估其大小写敏感性,同时使用与随机性的差异和互信息的组合来对信息增益进行建模。使用CRAFT语料库进行了广泛的评估。实验结果表明,在蛋白质本体和基因本体细胞组件上,与非大小写敏感基线相比,大小写敏感性意识使F1值提高了高达0.07。同样,在基因本体生物过程、分子功能和基因本体细胞组件的情况下,使用信息增益与标准基线相比,F1值提高了高达0.06。总体而言,根据基础词元分布,这些方法为扩充术语标签集以改进概念识别提供了有效的互补策略。