ARCH：通过聚合叙事编码健康记录分析构建大规模知识图谱

Gan Ziming, Zhou Doudou, Rush Everett, Panickan Vidul A, Ho Yuk-Lam, Ostrouchov George, Xu Zhiwei, Shen Shuting, Xiong Xin, Greco Kimberly F, Hong Chuan, Bonzel Clara-Lea, Wen Jun, Costa Lauren, Cai Tianrun, Begoli Edmon, Xia Zongqi, Gaziano J Michael, Liao Katherine P, Cho Kelly, Cai Tianxi, Lu Junwei

University of Chicago, Chicago, IL, USA.

Harvard T.H. Chan School of Public Health, Boston, MA, USA.

medRxiv. 2023 May 21:2023.05.14.23289955. doi: 10.1101/2023.05.14.23289955.

OBJECTIVE

Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes, covering hundreds of thousands of clinical concepts available for research and clinical care. The complex, massive, heterogeneous, and noisy nature of EHR data imposes significant challenges for feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient ggregated narative odified ealth (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features.

METHODS

The ARCH algorithm first derives embedding vectors from a co-occurrence matrix of all EHR concepts and then generates cosine similarities along with associated -values to measure the strength of relatedness between clinical features with statistical certainty quantification. In the final step, ARCH performs a sparse embedding regression to remove indirect linkage between entity pairs. We validated the clinical utility of the ARCH knowledge graph, generated from 12.5 million patients in the Veterans Affairs (VA) healthcare system, through downstream tasks including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients.

RESULTS

ARCH produces high-quality clinical embeddings and KG for over 60,000 EHR concepts, as visualized in the R-shiny powered web-API (https://celehs.hms.harvard.edu/ARCH/). The ARCH embeddings attained an average area under the ROC curve (AUC) of 0.926 and 0.861 for detecting pairs of similar EHR concepts when the concepts are mapped to codified data and to NLP data; and 0.810 (codified) and 0.843 (NLP) for detecting related pairs. Based on the -values computed by ARCH, the sensitivity of detecting similar and related entity pairs are 0.906 and 0.888 under false discovery rate (FDR) control of 5%. For detecting drug side effects, the cosine similarity based on the ARCH semantic representations achieved an AUC of 0.723 while the AUC improved to 0.826 after few-shot training via minimizing the loss function on the training data set. Incorporating NLP data substantially improved the ability to detect side effects in the EHR. For example, based on unsupervised ARCH embeddings, the power of detecting drug-side effects pairs when using codified data only was 0.15, much lower than the power of 0.51 when using both codified and NLP concepts. Compared to existing large-scale representation learning methods including PubmedBERT, BioBERT and SAPBERT, ARCH attains the most robust performance and substantially higher accuracy in detecting these relationships. Incorporating ARCH selected features in weakly supervised phenotyping algorithms can improve the robustness of algorithm performance, especially for diseases that benefit from NLP features as supporting evidence. For example, the phenotyping algorithm for depression attained an AUC of 0.927 when using ARCH selected features but only 0.857 when using codified features selected via the KESER network[1]. In addition, embeddings and knowledge graphs generated from the ARCH network were able to cluster AD patients into two subgroups, where the fast progression subgroup had a much higher mortality rate.

CONCLUSIONS

The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.

目的

电子健康记录（EHR）系统包含大量临床数据，这些数据以编码数据和自由文本叙述性记录的形式存储，涵盖了数十万可用于研究和临床护理的临床概念。EHR数据的复杂性、海量性、异构性和噪声性给特征表示、信息提取和不确定性量化带来了重大挑战。为应对这些挑战，我们提出了一种高效的聚合叙述性修正健康（ARCH）记录分析方法，以生成一个大规模知识图谱（KG），用于全面的EHR编码和叙述性特征集。

方法

ARCH算法首先从所有EHR概念的共现矩阵中导出嵌入向量，然后生成余弦相似度以及相关的p值，以在具有统计确定性量化的情况下测量临床特征之间的相关性强度。在最后一步中，ARCH执行稀疏嵌入回归以消除实体对之间的间接联系。我们通过下游任务验证了从退伍军人事务（VA）医疗系统中的1250万患者生成的ARCH知识图谱的临床实用性，这些任务包括检测实体对之间的已知关系、预测药物副作用、疾病表型分析以及对阿尔茨海默病患者进行亚型分类。