生物医学领域中语义相似性和相关性的度量。

Measures of semantic similarity and relatedness in the biomedical domain.

作者信息

Pedersen Ted, Pakhomov Serguei V S, Patwardhan Siddharth, Chute Christopher G

机构信息

Department of Computer Science, 1114 Kirby Drive, University of Minnesota, Duluth, MN 55812, USA.

出版信息

J Biomed Inform. 2007 Jun;40(3):288-99. doi: 10.1016/j.jbi.2006.06.004. Epub 2006 Jun 10.

Abstract

Measures of semantic similarity between concepts are widely used in Natural Language Processing. In this article, we show how six existing domain-independent measures can be adapted to the biomedical domain. These measures were originally based on WordNet, an English lexical database of concepts and relations. In this research, we adapt these measures to the SNOMED-CT ontology of medical concepts. The measures include two path-based measures, and three measures that augment path-based measures with information content statistics from corpora. We also derive a context vector measure based on medical corpora that can be used as a measure of semantic relatedness. These six measures are evaluated against a newly created test bed of 30 medical concept pairs scored by three physicians and nine medical coders. We find that the medical coders and physicians differ in their ratings, and that the context vector measure correlates most closely with the physicians, while the path-based measures and one of the information content measures correlates most closely with the medical coders. We conclude that there is a role both for more flexible measures of relatedness based on information derived from corpora, as well as for measures that rely on existing ontological structures.

摘要

概念之间的语义相似性度量在自然语言处理中被广泛使用。在本文中,我们展示了六种现有的与领域无关的度量如何能够适用于生物医学领域。这些度量最初基于WordNet,一个关于概念和关系的英语词汇数据库。在本研究中,我们将这些度量适用于医学概念的SNOMED-CT本体。这些度量包括两种基于路径的度量,以及三种通过语料库中的信息内容统计来增强基于路径的度量的度量。我们还基于医学语料库推导了一种上下文向量度量,它可以用作语义相关性的度量。针对由三位医生和九位医学编码员评分的30个医学概念对的新创建测试集,对这六种度量进行了评估。我们发现医学编码员和医生的评分存在差异,并且上下文向量度量与医生的评分相关性最高,而基于路径的度量和一种信息内容度量与医学编码员的评分相关性最高。我们得出结论,基于从语料库中导出的信息的更灵活的相关性度量以及依赖现有本体结构的度量都有其作用。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索