Chang David, Balažević Ivana, Allen Carl, Chawla Daniel, Brandt Cynthia, Taylor Richard Andrew
Yale Center for Medical Informatics, Yale University.
School of Informatics, University of Edinburgh, UK.
Proc Conf Assoc Comput Linguist Meet. 2020 Jul;2020:167-176. doi: 10.18653/v1/2020.bionlp-1.18.
Much of biomedical and healthcare data is encoded in discrete, symbolic form such as text and medical codes. There is a wealth of expert-curated biomedical domain knowledge stored in knowledge bases and ontologies, but the lack of reliable methods for learning knowledge representation has limited their usefulness in machine learning applications. While text-based representation learning has significantly improved in recent years through advances in natural language processing, attempts to learn biomedical concept embeddings so far have been lacking. A recent family of models called knowledge graph embeddings have shown promising results on general domain knowledge graphs, and we explore their capabilities in the biomedical domain. We train several state-of-the-art knowledge graph embedding models on the SNOMED-CT knowledge graph, provide a benchmark with comparison to existing methods and in-depth discussion on best practices, and make a case for the importance of leveraging the multi-relational nature of knowledge graphs for learning biomedical knowledge representation. The embeddings, code, and materials will be made available to the community.
许多生物医学和医疗保健数据都以离散的符号形式编码,如文本和医学代码。知识库和本体中存储了大量由专家精心策划的生物医学领域知识,但缺乏可靠的知识表示学习方法限制了它们在机器学习应用中的效用。虽然近年来基于文本的表示学习通过自然语言处理的进展有了显著改进,但到目前为止,学习生物医学概念嵌入的尝试仍很缺乏。最近一类称为知识图谱嵌入的模型在通用领域知识图谱上显示出了有前景的结果,我们探索它们在生物医学领域的能力。我们在SNOMED-CT知识图谱上训练了几个最先进的知识图谱嵌入模型,提供了与现有方法比较的基准以及关于最佳实践的深入讨论,并论证了利用知识图谱的多关系性质来学习生物医学知识表示的重要性。这些嵌入、代码和材料将提供给社区。