基于谱系的医学主题词表（MeSH）中多个本体自我扩展方法

Genealogical-based method for multiple ontology self-extension in MeSH.

作者信息

Guo Yu-Wen, Tang Yi-Tsung, Kao Hung-Yu

出版信息

IEEE Trans Nanobioscience. 2014 Jun;13(2):124-30. doi: 10.1109/TNB.2014.2320413.

DOI:10.1109/TNB.2014.2320413

Abstract

During the last decade, the advent of Ontologies used for biomedical annotation has had a deep impact on life science. MeSH is a well-known Ontology for the purpose of indexing journal articles in PubMed, improving literature searching on multi-domain topics. Since the explosion of data growth in recent years, there are new terms, concepts that weed through the old and bring forth the new. Automatically extending sets of existing terms will enable bio-curators to systematically improve text-based ontologies level by level. However, most of the related techniques which apply symbolic patterns based on a literature corpus tend to focus on more general but not specific parts of the ontology. Therefore, in this work, we present a novel method for utilizing genealogical information from Ontology itself to find suitable siblings for ontology extension. Based on the breadth and depth dimensions, the sibling generation stage and pruning strategy are proposed in our approach. As a result, on the average, the precision of the genealogical-based method achieved 0.5, with the best 0.83 performance of category "Organisms." We also achieve average precision 0.69 of 229 new terms in MeSH 2013 version.

摘要

在过去十年中，用于生物医学注释的本体论的出现对生命科学产生了深远影响。医学主题词表（MeSH）是一种著名的本体论，用于在PubMed中对期刊文章进行索引，改善多领域主题的文献检索。近年来，随着数据量的爆炸式增长，出现了新的术语和概念，推陈出新。自动扩展现有术语集将使生物编目人员能够系统地逐级改进基于文本的本体论。然而，大多数基于文献语料库应用符号模式的相关技术往往侧重于本体论中更通用而非特定的部分。因此，在这项工作中，我们提出了一种新颖的方法，利用本体论本身的谱系信息来寻找适合本体扩展的兄弟术语。基于广度和深度维度，我们的方法提出了兄弟术语生成阶段和修剪策略。结果，基于谱系的方法平均精度达到0.5，“生物体”类别的最佳性能为0.83。我们在《医学主题词表》2013版中的229个新术语上也实现了平均精度0.69。