Suppr超能文献

ARCH:通过汇总叙述性编码健康记录分析构建大规模知识图谱

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.

作者信息

Gan Ziming, Zhou Doudou, Rush Everett, Panickan Vidul A, Ho Yuk-Lam, Ostrouchovm George, Xu Zhiwei, Shen Shuting, Xiong Xin, Greco Kimberly F, Hong Chuan, Bonzel Clara-Lea, Wen Jun, Costa Lauren, Cai Tianrun, Begoli Edmon, Xia Zongqi, Gaziano J Michael, Liao Katherine P, Cho Kelly, Cai Tianxi, Lu Junwei

机构信息

Department of Statistics, University of Chicago, 5801 S Ellis Ave, Chicago, 60615, IL, USA.

Department of Statistics and Data Science, National University of Singapore, 117546, Singapore.

出版信息

J Biomed Inform. 2025 Feb;162:104761. doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23.

Abstract

OBJECTIVE

Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features.

METHODS

Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients.

RESULTS

ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API. ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms' performance. Notably, it successfully categorized Alzheimer's patients into two subgroups with varying mortality rates.

CONCLUSION

The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.

摘要

目的

电子健康记录(EHR)系统包含大量以编码数据和自由文本叙述性笔记(NLP)形式存储的临床数据。EHR的复杂性在特征表示、信息提取和不确定性量化方面带来了挑战。为应对这些挑战,我们提出了一种高效的聚合叙述性编码健康(ARCH)记录分析方法,以生成一个包含全面的EHR编码和叙述特征的大规模知识图谱(KG)。

方法

利用来自1250万退伍军人事务患者的数据,ARCH首先导出嵌入向量并生成相似度以及相关的p值,以通过统计确定性量化来测量临床特征之间的相关性强度。接下来,ARCH执行稀疏嵌入回归以消除特征之间的间接联系,从而构建一个稀疏KG。最后,ARCH在各种临床任务上进行了验证,包括检测实体对之间的已知关系、预测药物副作用、疾病表型分析以及对阿尔茨海默病患者进行亚型分类。

结果

ARCH为超过60000个编码和叙述性EHR概念生成了高质量的临床嵌入和KG。该KG和嵌入在由R-shiny驱动的网络应用程序编程接口中可视化。ARCH在检测EHR概念关系方面实现了高精度,对于相似的EHR概念,编码数据的AUC为0.926,NLP数据的AUC为0.861;对于相关对,编码数据的AUC为0.810,NLP数据的AUC为0.843。它检测药物副作用的AUC为0.723,微调后提高到0.826。使用编码和NLP特征,检测能力显著提高。与其他方法相比,ARCH具有更高的准确性,并提高了弱监督表型算法的性能。值得注意的是,它成功地将阿尔茨海默病患者分为两个死亡率不同的亚组。

结论

所提出的ARCH算法为编码和NLP EHR特征生成了大规模高质量的语义表示和知识图谱,可用于广泛的预测建模任务。

相似文献

1
ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.
J Biomed Inform. 2025 Feb;162:104761. doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23.
2
ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis.
medRxiv. 2023 May 21:2023.05.14.23289955. doi: 10.1101/2023.05.14.23289955.
3
Multiview Incomplete Knowledge Graph Integration with application to cross-institutional EHR data harmonization.
J Biomed Inform. 2022 Sep;133:104147. doi: 10.1016/j.jbi.2022.104147. Epub 2022 Jul 21.
4
Automated feature selection of predictors in electronic medical records data.
Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.
5
A comparison of word embeddings for the biomedical natural language processing.
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
6
Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.
J Am Med Inform Assoc. 2015 Sep;22(5):993-1000. doi: 10.1093/jamia/ocv034. Epub 2015 Apr 29.
8
Weakly Semi-supervised phenotyping using Electronic Health records.
J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.
9
DOME: Directional medical embedding vectors from Electronic Health Records.
J Biomed Inform. 2025 Feb;162:104768. doi: 10.1016/j.jbi.2024.104768. Epub 2025 Jan 2.

引用本文的文献

2
DOME: Directional medical embedding vectors from Electronic Health Records.
J Biomed Inform. 2025 Feb;162:104768. doi: 10.1016/j.jbi.2024.104768. Epub 2025 Jan 2.

本文引用的文献

1
Multimodal representation learning for predicting molecule-disease relations.
Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad085.
2
Multiview Incomplete Knowledge Graph Integration with application to cross-institutional EHR data harmonization.
J Biomed Inform. 2022 Sep;133:104147. doi: 10.1016/j.jbi.2022.104147. Epub 2022 Jul 21.
3
The Association Between Thyroid Diseases and Alzheimer's Disease in a National Health Screening Cohort in Korea.
Front Endocrinol (Lausanne). 2022 Mar 7;13:815063. doi: 10.3389/fendo.2022.815063. eCollection 2022.
4
Deciphering the Roles of Metformin in Alzheimer's Disease: A Snapshot.
Front Pharmacol. 2022 Jan 27;12:728315. doi: 10.3389/fphar.2021.728315. eCollection 2021.
5
CODER: Knowledge-infused cross-lingual medical term embedding for term normalization.
J Biomed Inform. 2022 Feb;126:103983. doi: 10.1016/j.jbi.2021.103983. Epub 2022 Jan 4.
8
Combining structured and unstructured data for predictive models: a deep learning approach.
BMC Med Inform Decis Mak. 2020 Oct 29;20(1):280. doi: 10.1186/s12911-020-01297-6.
9
sureLDA: A multidisease automated phenotyping method for the electronic health record.
J Am Med Inform Assoc. 2020 Aug 1;27(8):1235-1243. doi: 10.1093/jamia/ocaa079.
10

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验