真实世界数据医疗知识图谱：构建与应用。

Real-world data medical knowledge graph: construction and applications.

机构信息

Institute of Information Science, Beijing Jiaotong University, Beijing, China; Yidu Cloud Technology Inc., Beijing, China.

College of Computer Science, Chongqing University, Chongqing, China; Southwest Hospital, Chongqing, China.

出版信息

Artif Intell Med. 2020 Mar;103:101817. doi: 10.1016/j.artmed.2020.101817. Epub 2020 Feb 6.

DOI:10.1016/j.artmed.2020.101817

PMID:32143785

Abstract

OBJECTIVE

Medical knowledge graph (KG) is attracting attention from both academic and healthcare industry due to its power in intelligent healthcare applications. In this paper, we introduce a systematic approach to build medical KG from electronic medical records (EMRs) with evaluation by both technical experiments and end to end application examples.

MATERIALS AND METHODS

The original data set contains 16,217,270 de-identified clinical visit data of 3,767,198 patients. The KG construction procedure includes 8 steps, which are data preparation, entity recognition, entity normalization, relation extraction, property calculation, graph cleaning, related-entity ranking, and graph embedding respectively. We propose a novel quadruplet structure to represent medical knowledge instead of the classical triplet in KG. A novel related-entity ranking function considering probability, specificity and reliability (PSR) is proposed. Besides, probabilistic translation on hyperplanes (PrTransH) algorithm is used to learn graph embedding for the generated KG.

RESULTS

A medical KG with 9 entity types including disease, symptom, etc. was established, which contains 22,508 entities and 579,094 quadruplets. Compared with term frequency - inverse document frequency (TF/IDF) method, the normalized discounted cumulative gain (NDCG@10) increased from 0.799 to 0.906 with the proposed ranking function. The embedding representation for all entities and relations were learned, which are proven to be effective using disease clustering.

CONCLUSION

The established systematic procedure can efficiently construct a high-quality medical KG from large-scale EMRs. The proposed ranking function PSR achieves the best performance under all relations, and the disease clustering result validates the efficacy of the learned embedding vector as entity's semantic representation. Moreover, the obtained KG finds many successful applications due to its statistics-based quadruplet. where N is a minimum co-occurrence number and R is the basic reliability value. The reliability value can measure how reliable is the relationship between S and O. The reason for the definition is the higher value of N(S O), the relationship is more reliable. However, the reliability values of the two relationships should not have a big difference if both of their co-occurrence numbers are very big. In our study, we finally set N = 10 and R = 1 after some experiments. For instance, if co-occurrence numbers of three relationships are 1, 100 and 10000, their reliability values are 1, 2.96 and 5 respectively.

摘要

目的

由于在智能医疗应用中的强大功能，医学知识图谱（KG）正受到学术界和医疗保健行业的关注。在本文中，我们介绍了一种从电子病历（EMR）中构建医疗 KG 的系统方法，并通过技术实验和端到端应用示例进行了评估。

材料和方法

原始数据集包含 3767198 名患者的 16217270 条去识别临床就诊数据。KG 构建过程包括 8 个步骤，分别是数据准备、实体识别、实体规范化、关系提取、属性计算、图清理、相关实体排序和图嵌入。我们提出了一种新的四元组结构来表示医学知识，而不是 KG 中的经典三元组。我们提出了一种新的考虑概率、特异性和可靠性（PSR）的相关实体排序函数。此外，还使用概率超平面转换（PrTransH）算法学习生成的 KG 的图嵌入。

结果

建立了一个包含疾病、症状等 9 种实体类型的医疗 KG，其中包含 22508 个实体和 579094 个四元组。与词频-逆文档频率（TF/IDF）方法相比，使用提出的排序函数后，归一化折扣累积增益（NDCG@10）从 0.799 增加到 0.906。对所有实体和关系进行了嵌入表示的学习，使用疾病聚类证明了其有效性。

结论

该系统方法可以从大规模的 EMR 中高效构建高质量的医疗 KG。所提出的 PSR 排序函数在所有关系下都能达到最佳性能，疾病聚类结果验证了学习得到的嵌入向量作为实体语义表示的有效性。此外，由于基于统计的四元组，所获得的 KG 找到了许多成功的应用。其中，N 是最小共现次数，R 是基本可靠性值。可靠性值可以衡量 S 和 O 之间的关系的可靠性。定义的原因是 S 和 O 的共现次数越高，关系越可靠。但是，如果两个关系的共现次数都非常大，它们的可靠性值不应有太大差异。在我们的研究中，我们最终在一些实验后设置了 N=10 和 R=1。例如，如果三个关系的共现次数分别为 1、100 和 10000，则它们的可靠性值分别为 1、2.96 和 5。