• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ARCH:通过汇总叙述性编码健康记录分析构建大规模知识图谱

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.

作者信息

Gan Ziming, Zhou Doudou, Rush Everett, Panickan Vidul A, Ho Yuk-Lam, Ostrouchovm George, Xu Zhiwei, Shen Shuting, Xiong Xin, Greco Kimberly F, Hong Chuan, Bonzel Clara-Lea, Wen Jun, Costa Lauren, Cai Tianrun, Begoli Edmon, Xia Zongqi, Gaziano J Michael, Liao Katherine P, Cho Kelly, Cai Tianxi, Lu Junwei

机构信息

Department of Statistics, University of Chicago, 5801 S Ellis Ave, Chicago, 60615, IL, USA.

Department of Statistics and Data Science, National University of Singapore, 117546, Singapore.

出版信息

J Biomed Inform. 2025 Feb;162:104761. doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23.

DOI:10.1016/j.jbi.2024.104761
PMID:39863245
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12066163/
Abstract

OBJECTIVE

Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features.

METHODS

Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients.

RESULTS

ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API. ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms' performance. Notably, it successfully categorized Alzheimer's patients into two subgroups with varying mortality rates.

CONCLUSION

The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.

摘要

目的

电子健康记录(EHR)系统包含大量以编码数据和自由文本叙述性笔记(NLP)形式存储的临床数据。EHR的复杂性在特征表示、信息提取和不确定性量化方面带来了挑战。为应对这些挑战,我们提出了一种高效的聚合叙述性编码健康(ARCH)记录分析方法,以生成一个包含全面的EHR编码和叙述特征的大规模知识图谱(KG)。

方法

利用来自1250万退伍军人事务患者的数据,ARCH首先导出嵌入向量并生成相似度以及相关的p值,以通过统计确定性量化来测量临床特征之间的相关性强度。接下来,ARCH执行稀疏嵌入回归以消除特征之间的间接联系,从而构建一个稀疏KG。最后,ARCH在各种临床任务上进行了验证,包括检测实体对之间的已知关系、预测药物副作用、疾病表型分析以及对阿尔茨海默病患者进行亚型分类。

结果

ARCH为超过60000个编码和叙述性EHR概念生成了高质量的临床嵌入和KG。该KG和嵌入在由R-shiny驱动的网络应用程序编程接口中可视化。ARCH在检测EHR概念关系方面实现了高精度,对于相似的EHR概念,编码数据的AUC为0.926,NLP数据的AUC为0.861;对于相关对,编码数据的AUC为0.810,NLP数据的AUC为0.843。它检测药物副作用的AUC为0.723,微调后提高到0.826。使用编码和NLP特征,检测能力显著提高。与其他方法相比,ARCH具有更高的准确性,并提高了弱监督表型算法的性能。值得注意的是,它成功地将阿尔茨海默病患者分为两个死亡率不同的亚组。

结论

所提出的ARCH算法为编码和NLP EHR特征生成了大规模高质量的语义表示和知识图谱,可用于广泛的预测建模任务。

相似文献

1
ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.ARCH:通过汇总叙述性编码健康记录分析构建大规模知识图谱
J Biomed Inform. 2025 Feb;162:104761. doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23.
2
ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis.ARCH:通过聚合叙事编码健康记录分析构建大规模知识图谱
medRxiv. 2023 May 21:2023.05.14.23289955. doi: 10.1101/2023.05.14.23289955.
3
Multiview Incomplete Knowledge Graph Integration with application to cross-institutional EHR data harmonization.多视图不完整知识图集成及其在跨机构电子健康记录数据协调中的应用。
J Biomed Inform. 2022 Sep;133:104147. doi: 10.1016/j.jbi.2022.104147. Epub 2022 Jul 21.
4
Automated feature selection of predictors in electronic medical records data.电子病历数据中预测指标的自动特征选择
Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.
5
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
6
Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.迈向高通量表型分析:从知识源中进行无偏自动特征提取与选择。
J Am Med Inform Assoc. 2015 Sep;22(5):993-1000. doi: 10.1093/jamia/ocv034. Epub 2015 Apr 29.
7
Disease Concept-Embedding Based on the Self-Supervised Method for Medical Information Extraction from Electronic Health Records and Disease Retrieval: Algorithm Development and Validation Study.基于自监督方法的疾病概念嵌入在电子健康记录中的医学信息提取和疾病检索:算法开发和验证研究。
J Med Internet Res. 2021 Jan 27;23(1):e25113. doi: 10.2196/25113.
8
Weakly Semi-supervised phenotyping using Electronic Health records.基于电子健康记录的弱监督表型研究
J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.
9
DOME: Directional medical embedding vectors from Electronic Health Records.DOME:来自电子健康记录的定向医学嵌入向量。
J Biomed Inform. 2025 Feb;162:104768. doi: 10.1016/j.jbi.2024.104768. Epub 2025 Jan 2.
10
Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review.电子健康记录中自由文本叙述的症状的自然语言处理:系统评价。
J Am Med Inform Assoc. 2019 Apr 1;26(4):364-379. doi: 10.1093/jamia/ocy173.

引用本文的文献

1
Advancing the Use of Longitudinal Electronic Health Records: Tutorial for Uncovering Real-World Evidence in Chronic Disease Outcomes.推进纵向电子健康记录的应用:慢性病结局中发现真实世界证据的教程。
J Med Internet Res. 2025 May 12;27:e71873. doi: 10.2196/71873.
2
DOME: Directional medical embedding vectors from Electronic Health Records.DOME:来自电子健康记录的定向医学嵌入向量。
J Biomed Inform. 2025 Feb;162:104768. doi: 10.1016/j.jbi.2024.104768. Epub 2025 Jan 2.

本文引用的文献

1
Multimodal representation learning for predicting molecule-disease relations.基于多模态表示学习的药物-疾病关系预测
Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad085.
2
Multiview Incomplete Knowledge Graph Integration with application to cross-institutional EHR data harmonization.多视图不完整知识图集成及其在跨机构电子健康记录数据协调中的应用。
J Biomed Inform. 2022 Sep;133:104147. doi: 10.1016/j.jbi.2022.104147. Epub 2022 Jul 21.
3
The Association Between Thyroid Diseases and Alzheimer's Disease in a National Health Screening Cohort in Korea.韩国国家健康筛查队列中甲状腺疾病与阿尔茨海默病的关联。
Front Endocrinol (Lausanne). 2022 Mar 7;13:815063. doi: 10.3389/fendo.2022.815063. eCollection 2022.
4
Deciphering the Roles of Metformin in Alzheimer's Disease: A Snapshot.解读二甲双胍在阿尔茨海默病中的作用:概述
Front Pharmacol. 2022 Jan 27;12:728315. doi: 10.3389/fphar.2021.728315. eCollection 2021.
5
CODER: Knowledge-infused cross-lingual medical term embedding for term normalization.知识注入的跨语言医学术语嵌入用于术语归一化。
J Biomed Inform. 2022 Feb;126:103983. doi: 10.1016/j.jbi.2021.103983. Epub 2022 Jan 4.
6
Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data.通过稀疏嵌入回归(KESER)利用多中心大规模电子健康记录数据进行临床知识提取。
NPJ Digit Med. 2021 Oct 27;4(1):151. doi: 10.1038/s41746-021-00519-z.
7
Digital oximetry biomarkers for assessing respiratory function: standards of measurement, physiological interpretation, and clinical use.用于评估呼吸功能的数字血氧测定生物标志物:测量标准、生理学解释及临床应用
NPJ Digit Med. 2021 Jan 4;4(1):1. doi: 10.1038/s41746-020-00373-5.
8
Combining structured and unstructured data for predictive models: a deep learning approach.将结构化和非结构化数据结合用于预测模型:一种深度学习方法。
BMC Med Inform Decis Mak. 2020 Oct 29;20(1):280. doi: 10.1186/s12911-020-01297-6.
9
sureLDA: A multidisease automated phenotyping method for the electronic health record.SureLDA:一种电子健康记录中的多疾病自动化表型方法。
J Am Med Inform Assoc. 2020 Aug 1;27(8):1235-1243. doi: 10.1093/jamia/ocaa079.
10
Vitamin D deficiency as a risk factor for dementia and Alzheimer's disease: an updated meta-analysis.维生素 D 缺乏是痴呆症和阿尔茨海默病的危险因素:一项更新的荟萃分析。
BMC Neurol. 2019 Nov 13;19(1):284. doi: 10.1186/s12883-019-1500-6.