Lee Junghwan, Liu Cong, Kim Jae Hyun, Butler Alex, Shang Ning, Pang Chao, Natarajan Karthik, Ryan Patrick, Ta Casey, Weng Chunhua
Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA.
JAMIA Open. 2021 Jun 16;4(2):ooab028. doi: 10.1093/jamiaopen/ooab028. eCollection 2021 Apr.
Feature engineering is a major bottleneck in phenotyping. Properly learned medical concept embeddings (MCEs) capture the semantics of medical concepts, thus are useful for retrieving relevant medical features in phenotyping tasks. We compared the effectiveness of MCEs learned from knowledge graphs and electronic healthcare records (EHR) data in retrieving relevant medical features for phenotyping tasks.
We implemented 5 embedding methods including node2vec, singular value decomposition (SVD), LINE, skip-gram, and GloVe with 2 data sources: (1) knowledge graphs obtained from the observational medical outcomes partnership (OMOP) common data model; and (2) patient-level data obtained from the OMOP compatible electronic health records (EHR) from Columbia University Irving Medical Center (CUIMC). We used phenotypes with their relevant concepts developed and validated by the electronic medical records and genomics (eMERGE) network to evaluate the performance of learned MCEs in retrieving phenotype-relevant concepts. in retrieving phenotype-relevant concepts based on a single and multiple seed concept(s) was used to evaluate MCEs.
Among all MCEs, MCEs learned by using node2vec with knowledge graphs showed the best performance. Of MCEs based on knowledge graphs and EHR data, MCEs learned by using node2vec with knowledge graphs and MCEs learned by using GloVe with EHR data outperforms other MCEs, respectively.
MCE enables scalable feature engineering tasks, thereby facilitating phenotyping. Based on current phenotyping practices, MCEs learned by using knowledge graphs constructed by hierarchical relationships among medical concepts outperformed MCEs learned by using EHR data.
特征工程是表型分析中的一个主要瓶颈。正确学习的医学概念嵌入(MCE)能够捕捉医学概念的语义,因此在表型分析任务中检索相关医学特征时很有用。我们比较了从知识图谱和电子健康记录(EHR)数据中学习到的MCE在检索表型分析任务相关医学特征方面的有效性。
我们使用2个数据源实现了5种嵌入方法,包括node2vec、奇异值分解(SVD)、LINE、skip-gram和GloVe:(1)从观察性医学结局合作组织(OMOP)通用数据模型获得的知识图谱;以及(2)从哥伦比亚大学欧文医学中心(CUIMC)的OMOP兼容电子健康记录(EHR)中获得的患者级数据。我们使用由电子病历与基因组学(eMERGE)网络开发和验证的表型及其相关概念,来评估学习到的MCE在检索与表型相关概念方面的性能。基于单个和多个种子概念检索与表型相关概念被用于评估MCE。
在所有MCE中,使用node2vec从知识图谱学习到的MCE表现最佳。在基于知识图谱和EHR数据的MCE中,使用node2vec从知识图谱学习到的MCE和使用GloVe从EHR数据学习到的MCE分别优于其他MCE。
MCE能够实现可扩展的特征工程任务,从而促进表型分析。基于当前的表型分析实践,通过使用由医学概念之间的层次关系构建的知识图谱学习到的MCE优于通过使用EHR数据学习到的MCE。