Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing, China.
Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, China.
J Biomed Inform. 2022 Feb;126:103983. doi: 10.1016/j.jbi.2021.103983. Epub 2022 Jan 4.
This paper aims to propose knowledge-aware embedding, a critical tool for medical term normalization.
We develop CODER (Cross-lingual knowledge-infused medical term embedding) via contrastive learning based on a medical knowledge graph (KG) named the Unified Medical Language System, and similarities are calculated utilizing both terms and relation triplets from the KG. Training with relations injects medical knowledge into embeddings and can potentially improve their performance as machine learning features.
We evaluate CODER based on zero-shot term normalization, semantic similarity, and relation classification benchmarks, and the results show that CODER outperforms various state-of-the-art biomedical word embeddings, concept embeddings, and contextual embeddings.
CODER embeddings excellently reflect semantic similarity and relatedness of medical concepts. One can use CODER for embedding-based medical term normalization or to provide features for machine learning. Similar to other pretrained language models, CODER can also be fine-tuned for specific tasks. Codes and models are available at https://github.com/GanjinZero/CODER.
本文旨在提出知识感知嵌入,这是医学术语规范化的重要工具。
我们通过基于医学知识图谱(名为统一医学语言系统的 KG)的对比学习来开发 CODER(跨语言知识注入的医学术语嵌入),并利用 KG 中的术语和关系三元组来计算相似度。利用关系进行训练将医学知识注入到嵌入中,从而有可能提高它们作为机器学习特征的性能。
我们基于零镜头术语规范化、语义相似性和关系分类基准来评估 CODER,结果表明 CODER 优于各种最先进的生物医学词嵌入、概念嵌入和上下文嵌入。
CODER 嵌入极好地反映了医学概念的语义相似性和相关性。可以将 CODER 用于基于嵌入的医学术语规范化,或为机器学习提供特征。与其他预训练的语言模型类似,CODER 也可以针对特定任务进行微调。代码和模型可在 https://github.com/GanjinZero/CODER 上获得。