Gan Ziming, Zhou Doudou, Rush Everett, Panickan Vidul A, Ho Yuk-Lam, Ostrouchovm George, Xu Zhiwei, Shen Shuting, Xiong Xin, Greco Kimberly F, Hong Chuan, Bonzel Clara-Lea, Wen Jun, Costa Lauren, Cai Tianrun, Begoli Edmon, Xia Zongqi, Gaziano J Michael, Liao Katherine P, Cho Kelly, Cai Tianxi, Lu Junwei
Department of Statistics, University of Chicago, 5801 S Ellis Ave, Chicago, 60615, IL, USA.
Department of Statistics and Data Science, National University of Singapore, 117546, Singapore.
J Biomed Inform. 2025 Feb;162:104761. doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23.
Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features.
Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients.
ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API. ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms' performance. Notably, it successfully categorized Alzheimer's patients into two subgroups with varying mortality rates.
The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.
电子健康记录(EHR)系统包含大量以编码数据和自由文本叙述性笔记(NLP)形式存储的临床数据。EHR的复杂性在特征表示、信息提取和不确定性量化方面带来了挑战。为应对这些挑战,我们提出了一种高效的聚合叙述性编码健康(ARCH)记录分析方法,以生成一个包含全面的EHR编码和叙述特征的大规模知识图谱(KG)。
利用来自1250万退伍军人事务患者的数据,ARCH首先导出嵌入向量并生成相似度以及相关的p值,以通过统计确定性量化来测量临床特征之间的相关性强度。接下来,ARCH执行稀疏嵌入回归以消除特征之间的间接联系,从而构建一个稀疏KG。最后,ARCH在各种临床任务上进行了验证,包括检测实体对之间的已知关系、预测药物副作用、疾病表型分析以及对阿尔茨海默病患者进行亚型分类。
ARCH为超过60000个编码和叙述性EHR概念生成了高质量的临床嵌入和KG。该KG和嵌入在由R-shiny驱动的网络应用程序编程接口中可视化。ARCH在检测EHR概念关系方面实现了高精度,对于相似的EHR概念,编码数据的AUC为0.926,NLP数据的AUC为0.861;对于相关对,编码数据的AUC为0.810,NLP数据的AUC为0.843。它检测药物副作用的AUC为0.723,微调后提高到0.826。使用编码和NLP特征,检测能力显著提高。与其他方法相比,ARCH具有更高的准确性,并提高了弱监督表型算法的性能。值得注意的是,它成功地将阿尔茨海默病患者分为两个死亡率不同的亚组。
所提出的ARCH算法为编码和NLP EHR特征生成了大规模高质量的语义表示和知识图谱,可用于广泛的预测建模任务。