Department of Computer Science, Ben Gurion University, Beer Sheva, Israel.
Gertner Institute for Epidemiology and Health Policy Research, Tel HaShomer, Israel.
J Am Med Inform Assoc. 2020 Oct 1;27(10):1585-1592. doi: 10.1093/jamia/ocaa150.
In Hebrew online health communities, participants commonly write medical terms that appear as transliterated forms of a source term in English. Such transliterations introduce high variability in text and challenge text-analytics methods. To reduce their variability, medical terms must be normalized, such as linking them to Unified Medical Language System (UMLS) concepts. We present a method to identify both transliterated and translated Hebrew medical terms and link them with UMLS entities.
We investigate the effect of linking terms in Camoni, a popular Israeli online health community in Hebrew. Our method, MDTEL (Medical Deep Transliteration Entity Linking), includes (1) an attention-based recurrent neural network encoder-decoder to transliterate words and mapping UMLS from English to Hebrew, (2) an unsupervised method for creating a transliteration dataset in any language without manually labeled data, and (3) an efficient way to identify and link medical entities in the Hebrew corpus to UMLS concepts, by producing a high-recall list of candidate medical terms in the corpus, and then filtering the candidates to relevant medical terms.
We carry out experiments on 3 disease-specific communities: diabetes, multiple sclerosis, and depression. MDTEL tagging and normalizing on Camoni posts achieved 99% accuracy, 92% recall, and 87% precision. When tagging and normalizing terms in queries from the Camoni search logs, UMLS-normalized queries improved search results in 46% of the cases.
Cross-lingual UMLS entity linking from Hebrew is possible and improves search performance across communities. Annotated datasets, annotation guidelines, and code are made available online (https://github.com/yonatanbitton/mdtel).
在希伯来语在线健康社区中,参与者通常会书写以英语源词音译形式出现的医学术语。这种音译形式导致文本具有高度变异性,从而给文本分析方法带来挑战。为了降低这种变异性,必须对医学术语进行标准化,例如将其与统一医学语言系统(UMLS)概念相链接。我们提出了一种识别音译和翻译的希伯来语医学术语并将其与 UMLS 实体相链接的方法。
我们研究了在以色列广受欢迎的希伯来语在线健康社区 Camoni 中链接术语的效果。我们的方法 MDTEL(医学深度音译实体链接)包括:(1)基于注意力的循环神经网络编码器-解码器,用于音译单词并将 UMLS 从英语映射到希伯来语;(2)一种无需手动标记数据即可在任何语言中创建音译数据集的无监督方法;(3)一种在希伯来语语料库中识别和链接医学实体到 UMLS 概念的有效方法,通过在语料库中生成候选医学术语的高召回列表,然后过滤候选术语以获得相关的医学术语。
我们在 3 个特定疾病社区(糖尿病、多发性硬化症和抑郁症)上进行了实验。在 Camoni 帖子上进行 MDTEL 标记和规范化处理,准确率达到 99%,召回率达到 92%,精度达到 87%。在 Camoni 搜索日志中的查询进行标记和规范化时,UMLS 规范化查询将 46%的案例的搜索结果进行了改进。
从希伯来语到跨语言的 UMLS 实体链接是可行的,并能提高跨社区的搜索性能。已在网上提供了注释数据集、注释指南和代码(https://github.com/yonatanbitton/mdtel)。