Joubert Michel, Darmoni Stefan J, Avillach Paul, Dahamna Badisse, Fieschi Marius
LERTIM, Faculte de Medecine, Universite de la Mediterranee, Marseille, France.
Stud Health Technol Inform. 2008;136:205-10.
The aim of this study is to provide to indexers MeSH terms to be considered as major ones in a list of terms automatically extracted from a document.
We propose a method combining symbolic knowledge - the UMLS Metathesaurus and Semantic Network - and statistical knowledge drawn from co-occurrences of terms in the CISMeF database (a French-language quality-controlled health gateway) using data mining measures. The method was tested on CISMeF corpus of 293 resources.
There was a proportion of 0.37+/-0.26 major terms in the processed records. The method produced lists of terms with a proportion of terms initially pointed out as major of 0.54+/-0.31.
The method we propose reduces the number of terms, which seem not useful for content description of resources, such as "check tags", but retains the most descriptive ones. Discarding these terms is accounted for by: 1) the removal by using semantic knowledge of associations of concepts bearing no real medical significance, 2) the removal by using statistical knowledge of nonstatistically significant associations of terms.
This method can assist effectively indexers in their daily work and will be soon applied in the CISMeF system.
本研究的目的是为索引员提供从文档中自动提取的术语列表中应视为主要术语的医学主题词(MeSH)。
我们提出了一种方法,该方法结合了符号知识——统一医学语言系统(UMLS)元词表和语义网络——以及使用数据挖掘措施从CISMeF数据库(一个法语的质量控制健康网关)中的术语共现中得出的统计知识。该方法在293种资源的CISMeF语料库上进行了测试。
在处理的记录中,主要术语的比例为0.37±0.26。该方法生成的术语列表中,最初被指出为主要术语的比例为0.54±0.31。
我们提出的方法减少了诸如“检查标签”等对资源内容描述似乎无用的术语数量,但保留了最具描述性的术语。丢弃这些术语的原因如下:1)通过使用语义知识去除没有实际医学意义的概念关联;2)通过使用统计知识去除术语的非统计显著关联。
该方法可以有效地协助索引员进行日常工作,并将很快应用于CISMeF系统。