Kokkinakis Dimitrios
Centre for Language Technology, Department of Swedish Language, the Swedish Language Bank, University of Gothenburg, Gothenburg, Sweden.
Stud Health Technol Inform. 2011;169:814-8.
This paper reports on the results of a large scale mapping of SNOMED CT on scientific medical corpora. The aim is to automatically access the validity, reliability and coverage of the Swedish SNOMED-CT translation, the largest, most extensive available resource of medical terminology. The method described here is based on the generation of predominantly safe harbor term variants which together with simple linguistic processing and the already available SNOMED term content are mapped to large corpora. The results show that term variations are very frequent and this may have implication on technological applications (such as indexing and information retrieval, decision support systems, text mining) using SNOMED CT. Naïve approaches to terminology mapping and indexing would critically affect the performance, success and results of such applications. SNOMED CT appears not well-suited for automatically capturing the enormous variety of concepts in scientific corpora (only 6,3% of all SNOMED terms could be directly matched to the corpus) unless extensive variant forms are generated and fuzzy and partial matching techniques are applied with the risk of allowing the recognition of a large number of false positives and spurious results.
本文报告了在科学医学语料库上对SNOMED CT进行大规模映射的结果。目的是自动评估瑞典语SNOMED-CT翻译的有效性、可靠性和覆盖范围,该翻译是最大、最广泛的可用医学术语资源。这里描述的方法基于生成主要是安全港术语变体,这些变体与简单的语言处理以及已有的SNOMED术语内容一起被映射到大型语料库。结果表明,术语变体非常频繁,这可能会对使用SNOMED CT的技术应用(如索引和信息检索、决策支持系统、文本挖掘)产生影响。对术语映射和索引采用简单方法会严重影响此类应用的性能、成功率和结果。SNOMED CT似乎不太适合自动捕捉科学语料库中大量的概念(所有SNOMED术语中只有6.3%可以直接与语料库匹配),除非生成大量变体形式并应用模糊和部分匹配技术,而这存在识别大量误报和虚假结果的风险。