Suppr超能文献

基于医学术语的计算系统:针对词汇表外多词术语的轻量级后处理解决方案。

Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms.

作者信息

Saeed Nadia, Naveed Hammad

机构信息

Computational Biology Research Lab, Department of Computer Science, National University of Computer and Emerging Sciences (NUCES-FAST), Islamabad, Pakistan.

出版信息

Front Mol Biosci. 2022 Aug 12;9:928530. doi: 10.3389/fmolb.2022.928530. eCollection 2022.

Abstract

The linguistic rules of medical terminology assist in gaining acquaintance with rare/complex clinical and biomedical terms. The medical language follows a Greek and Latin-inspired nomenclature. This nomenclature aids the stakeholders in simplifying the medical terms and gaining semantic familiarity. However, natural language processing models misrepresent rare and complex biomedical words. In this study, we present MedTCS-a lightweight, post-processing module-to simplify hybridized or compound terms into regular words using medical nomenclature. MedTCS enabled the word-based embedding models to achieve 100% coverage and enabled the BiowordVec model to achieve high correlation scores (0.641 and 0.603 in UMNSRS similarity and relatedness datasets, respectively) that significantly surpass the n-gram and sub-word approaches of FastText and BERT. In the downstream task of named entity recognition (NER), MedTCS enabled the latest clinical embedding model of FastText-OA-All-300d to improve the F1-score from 0.45 to 0.80 on the BC5CDR corpus and from 0.59 to 0.81 on the NCBI-Disease corpus, respectively. Similarly, in the drug indication classification task, our model was able to increase the coverage by 9% and the F1-score by 1%. Our results indicate that incorporating a medical terminology-based module provides distinctive contextual clues to enhance vocabulary as a post-processing step on pre-trained embeddings. We demonstrate that the proposed module enables the word embedding models to generate vectors of out-of-vocabulary words effectively. We expect that our study can be a stepping stone for the use of biomedical knowledge-driven resources in NLP.

摘要

医学术语的语言规则有助于熟悉罕见/复杂的临床和生物医学术语。医学语言遵循受希腊语和拉丁语启发的命名法。这种命名法有助于相关人员简化医学术语并获得语义熟悉度。然而,自然语言处理模型会错误表示罕见和复杂的生物医学词汇。在本研究中,我们提出了MedTCS——一个轻量级的后处理模块——使用医学命名法将杂交或复合术语简化为常规词汇。MedTCS使基于单词的嵌入模型能够实现100%的覆盖率,并使BiowordVec模型能够获得高相关分数(在UMNSRS相似性和相关性数据集中分别为0.641和0.603),显著超过FastText和BERT的n-gram和子词方法。在命名实体识别(NER)的下游任务中,MedTCS使FastText-OA-All-300d的最新临床嵌入模型在BC5CDR语料库上的F1分数从0.45提高到0.80,在NCBI-Disease语料库上从0.59提高到0.81。同样,在药物适应症分类任务中,我们的模型能够将覆盖率提高9%,F1分数提高1%。我们的结果表明,纳入基于医学术语的模块作为预训练嵌入的后处理步骤,提供了独特的上下文线索来增强词汇量。我们证明,所提出的模块使单词嵌入模型能够有效地生成词汇外单词的向量。我们期望我们的研究能够成为在自然语言处理中使用生物医学知识驱动资源的垫脚石。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6d5f/9411640/fd70b557ebcd/fmolb-09-928530-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验