Instituto de Lengua, Literatura y Antropología (ILLA), CSIC (Spanish National Research Council), Albasanz 26-28, 28037, Madrid, Spain.
J Biomed Semantics. 2023 Feb 2;14(1):2. doi: 10.1186/s13326-022-00281-5.
Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish.
This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System[Formula: see text] (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries.
The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository.
医学词典使健康文本的自然语言处理(NLP)成为可能。词典从词库和本体中收集术语和概念,并从词汇数据中进行词性(PoS)标注、词干提取或自然语言生成。迄今为止,西班牙语还没有这样的资源。
本文介绍了一种用于西班牙语医学自然语言处理的统一医学词典。MedLexSp 包含具有 PoS 信息和统一医学语言系统[公式:见文本](UMLS)语义类型、组和概念唯一标识符(CUI)的术语和屈折词形式。为了创建它,我们使用了 NLP 技术和领域语料库(例如 MedlinePlus)。我们还从西班牙皇家医学科学院的医学术语词典、医学主题词(MeSH)、医学系统命名法-临床术语(SNOMED-CT)、药物监管活动术语学词典(MedDRA)、国际疾病分类第 10 版、解剖治疗化学分类、国家癌症研究所(NCI)词典、在线孟德尔遗传人类(OMIM)和 OrphaData 中收集了术语。与 COVID-19 相关的术语是通过应用基于相似性的方法,使用在大型语料库上训练的词嵌入来组装的。MedLexSp 包含 100,887 个词干、302,543 个屈折形式(共轭动词和数字/性别变体)和 42,958 个 UMLS CUI。我们报告了 MedLexSp 的两个用例。首先,将词典应用于预注释包含 1200 篇临床试验相关文本的语料库。其次,对临床病例相关的文本进行词性标注和词干提取。与默认的 Spacy 和 Stanza Python 库相比,MedLexSp 提高了词性标注和词干提取的分数。
该词典以分隔值文件的形式分发;具有词汇标记框架的 XML 文件;Spacy 和 Stanza 库的词干提取模块;以及补充词汇记录(LR)文件。公共存储库中提供了提取 COVID-19 术语的嵌入和代码,以及用医学术语丰富的 Spacy 和 Stanza 词干提取器。