College of Medicine, University of Florida, USA.
University of Montpellier, LIRMM, CNRS, Montpellier, France.
J Biomed Inform. 2018 Aug;84:31-41. doi: 10.1016/j.jbi.2018.06.007. Epub 2018 Jun 20.
Rapid advancements in biomedical research have accelerated the number of relevant electronic documents published online, ranging from scholarly articles to news, blogs, and user-generated social media content. Nevertheless, the vast amount of this information is poorly organized, making it difficult to navigate. Emerging technologies such as ontologies and knowledge bases (KBs) could help organize and track the information associated with biomedical research developments. A major challenge in the automatic construction of ontologies and KBs is the identification of words with its respective sense(s) from a free-text corpus. Word-sense induction (WSI) is a task to automatically induce the different senses of a target word in the different contexts. In the last two decades, there have been several efforts on WSI. However, few methods are effective in biomedicine and life sciences.
We developed a framework for biomedical entity sense induction using a mixture of natural language processing, supervised, and unsupervised learning methods with promising results. It is composed of three main steps: (1) a polysemy detection method to determine if a biomedical entity has many possible meanings; (2) a clustering quality index-based approach to predict the number of senses for the biomedical entity; and (3) a method to induce the concept(s) (i.e., senses) of the biomedical entity in a given context.
To evaluate our framework, we used the well-known MSH WSD polysemic dataset that contains 203 annotated ambiguous biomedical entities, where each entity is linked to 2-5 concepts. Our polysemy detection method obtained an F-measure of 98%. Second, our approach for predicting the number of senses achieved an F-measure of 93%. Finally, we induced the concepts of the biomedical entities based on a clustering algorithm and then extracted the keywords of reach cluster to represent the concept.
We have developed a framework for biomedical entity sense induction with promising results. Our study results can benefit a number of downstream applications, for example, help to resolve concept ambiguities when building Semantic Web KBs from biomedical text.
生物医学研究的快速发展加速了在线发表的相关电子文档数量的增长,这些文档的范围从学术文章到新闻、博客和用户生成的社交媒体内容。然而,大量的信息组织得很差,难以浏览。本体和知识库(KB)等新兴技术可以帮助组织和跟踪与生物医学研究发展相关的信息。本体和 KB 的自动构建中的一个主要挑战是从自由文本语料库中识别具有相应意义的单词。词义推断(WSI)是一项从不同上下文中自动推断目标单词的不同意义的任务。在过去的二十年中,已经有一些关于 WSI 的努力。然而,很少有方法在生物医学和生命科学中是有效的。
我们开发了一种使用自然语言处理、监督和无监督学习方法混合的生物医学实体意义感应框架,取得了有希望的结果。它由三个主要步骤组成:(1)多义性检测方法,用于确定生物医学实体是否有多种可能的含义;(2)基于聚类质量指数的方法,用于预测生物医学实体的意义数量;(3)一种在给定上下文中诱导生物医学实体的概念(即意义)的方法。
为了评估我们的框架,我们使用了著名的 MSH WSD 多义词数据集,其中包含 203 个注释的模糊生物医学实体,每个实体与 2-5 个概念相关联。我们的多义性检测方法获得了 98%的 F 度量。其次,我们的意义数量预测方法达到了 93%的 F 度量。最后,我们基于聚类算法诱导生物医学实体的概念,然后提取每个聚类的关键词来表示该概念。
我们开发了一种具有前景的生物医学实体意义感应框架。我们的研究结果可以使许多下游应用受益,例如,在从生物医学文本构建语义 Web KB 时帮助解决概念模糊性。