文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

ARCH:通过聚合叙事编码健康记录分析构建大规模知识图谱

ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis.

作者信息

Gan Ziming, Zhou Doudou, Rush Everett, Panickan Vidul A, Ho Yuk-Lam, Ostrouchov George, Xu Zhiwei, Shen Shuting, Xiong Xin, Greco Kimberly F, Hong Chuan, Bonzel Clara-Lea, Wen Jun, Costa Lauren, Cai Tianrun, Begoli Edmon, Xia Zongqi, Gaziano J Michael, Liao Katherine P, Cho Kelly, Cai Tianxi, Lu Junwei

机构信息

University of Chicago, Chicago, IL, USA.

Harvard T.H. Chan School of Public Health, Boston, MA, USA.

出版信息

medRxiv. 2023 May 21:2023.05.14.23289955. doi: 10.1101/2023.05.14.23289955.


DOI:10.1101/2023.05.14.23289955
PMID:37293026
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10246054/
Abstract

OBJECTIVE: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes, covering hundreds of thousands of clinical concepts available for research and clinical care. The complex, massive, heterogeneous, and noisy nature of EHR data imposes significant challenges for feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient ggregated narative odified ealth (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features. METHODS: The ARCH algorithm first derives embedding vectors from a co-occurrence matrix of all EHR concepts and then generates cosine similarities along with associated -values to measure the strength of relatedness between clinical features with statistical certainty quantification. In the final step, ARCH performs a sparse embedding regression to remove indirect linkage between entity pairs. We validated the clinical utility of the ARCH knowledge graph, generated from 12.5 million patients in the Veterans Affairs (VA) healthcare system, through downstream tasks including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients. RESULTS: ARCH produces high-quality clinical embeddings and KG for over 60,000 EHR concepts, as visualized in the R-shiny powered web-API (https://celehs.hms.harvard.edu/ARCH/). The ARCH embeddings attained an average area under the ROC curve (AUC) of 0.926 and 0.861 for detecting pairs of similar EHR concepts when the concepts are mapped to codified data and to NLP data; and 0.810 (codified) and 0.843 (NLP) for detecting related pairs. Based on the -values computed by ARCH, the sensitivity of detecting similar and related entity pairs are 0.906 and 0.888 under false discovery rate (FDR) control of 5%. For detecting drug side effects, the cosine similarity based on the ARCH semantic representations achieved an AUC of 0.723 while the AUC improved to 0.826 after few-shot training via minimizing the loss function on the training data set. Incorporating NLP data substantially improved the ability to detect side effects in the EHR. For example, based on unsupervised ARCH embeddings, the power of detecting drug-side effects pairs when using codified data only was 0.15, much lower than the power of 0.51 when using both codified and NLP concepts. Compared to existing large-scale representation learning methods including PubmedBERT, BioBERT and SAPBERT, ARCH attains the most robust performance and substantially higher accuracy in detecting these relationships. Incorporating ARCH selected features in weakly supervised phenotyping algorithms can improve the robustness of algorithm performance, especially for diseases that benefit from NLP features as supporting evidence. For example, the phenotyping algorithm for depression attained an AUC of 0.927 when using ARCH selected features but only 0.857 when using codified features selected via the KESER network[1]. In addition, embeddings and knowledge graphs generated from the ARCH network were able to cluster AD patients into two subgroups, where the fast progression subgroup had a much higher mortality rate. CONCLUSIONS: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.

摘要

目的:电子健康记录(EHR)系统包含大量临床数据,这些数据以编码数据和自由文本叙述性记录的形式存储,涵盖了数十万可用于研究和临床护理的临床概念。EHR数据的复杂性、海量性、异构性和噪声性给特征表示、信息提取和不确定性量化带来了重大挑战。为应对这些挑战,我们提出了一种高效的聚合叙述性修正健康(ARCH)记录分析方法,以生成一个大规模知识图谱(KG),用于全面的EHR编码和叙述性特征集。 方法:ARCH算法首先从所有EHR概念的共现矩阵中导出嵌入向量,然后生成余弦相似度以及相关的p值,以在具有统计确定性量化的情况下测量临床特征之间的相关性强度。在最后一步中,ARCH执行稀疏嵌入回归以消除实体对之间的间接联系。我们通过下游任务验证了从退伍军人事务(VA)医疗系统中的1250万患者生成的ARCH知识图谱的临床实用性,这些任务包括检测实体对之间的已知关系、预测药物副作用、疾病表型分析以及对阿尔茨海默病患者进行亚型分类。 结果:ARCH为超过60000个EHR概念生成了高质量的临床嵌入和知识图谱,如在由R-shiny驱动的网络应用程序编程接口(https://celehs.hms.harvard.edu/ARCH/)中可视化显示的那样。当概念映射到编码数据和自然语言处理(NLP)数据时,ARCH嵌入在检测相似EHR概念对时的ROC曲线下面积(AUC)平均分别为0.926和0.861;在检测相关对时,AUC分别为0.810(编码)和(NLP)0.843。基于ARCH计算的p值,在5%的错误发现率(FDR)控制下,检测相似和相关实体对的灵敏度分别为0.906和0.888。对于检测药物副作用,基于ARCH语义表示的余弦相似度的AUC为0.723,而在通过最小化训练数据集上的损失函数进行少样本训练后,AUC提高到了0.826。纳入NLP数据显著提高了在EHR中检测副作用的能力。例如,基于无监督的ARCH嵌入,仅使用编码数据时检测药物-副作用对的功效为0.15,远低于同时使用编码和NLP概念时的0.51。与包括PubmedBERT、BioBERT和SAPBERT在内的现有大规模表示学习方法相比,ARCH在检测这些关系时具有最稳健的性能和显著更高的准确性。在弱监督表型算法中纳入ARCH选择的特征可以提高算法性能的稳健性,特别是对于受益于NLP特征作为支持证据的疾病。例如,抑郁症的表型算法在使用ARCH选择的特征时AUC为0.927,而在使用通过KESER网络[1]选择的编码特征时仅为0.857。此外,从ARCH网络生成的嵌入和知识图谱能够将AD患者聚类为两个亚组,其中快速进展亚组的死亡率要高得多。 结论:所提出的ARCH算法为编码和NLP EHR特征生成了大规模高质量的语义表示和知识图谱,可用于广泛的预测建模任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/0cf9eeb83c98/nihpp-2023.05.14.23289955v1-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/3c4df742532a/nihpp-2023.05.14.23289955v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/05f662ef50fa/nihpp-2023.05.14.23289955v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/17c103188892/nihpp-2023.05.14.23289955v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/115ad910f9f4/nihpp-2023.05.14.23289955v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/d3b555bba5ea/nihpp-2023.05.14.23289955v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/cf4e627afb11/nihpp-2023.05.14.23289955v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/5ff7a82914c0/nihpp-2023.05.14.23289955v1-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/0cf9eeb83c98/nihpp-2023.05.14.23289955v1-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/3c4df742532a/nihpp-2023.05.14.23289955v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/05f662ef50fa/nihpp-2023.05.14.23289955v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/17c103188892/nihpp-2023.05.14.23289955v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/115ad910f9f4/nihpp-2023.05.14.23289955v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/d3b555bba5ea/nihpp-2023.05.14.23289955v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/cf4e627afb11/nihpp-2023.05.14.23289955v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/5ff7a82914c0/nihpp-2023.05.14.23289955v1-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f33/10246054/0cf9eeb83c98/nihpp-2023.05.14.23289955v1-f0008.jpg

相似文献

[1]
ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis.

medRxiv. 2023-5-21

[2]
ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.

J Biomed Inform. 2025-2

[3]
Multiview Incomplete Knowledge Graph Integration with application to cross-institutional EHR data harmonization.

J Biomed Inform. 2022-9

[4]
HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.

J Biomed Inform. 2019-6-27

[5]
Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data.

NPJ Digit Med. 2021-10-27

[6]
Adverse Drug Event Prediction Using Noisy Literature-Derived Knowledge Graphs: Algorithm Development and Validation.

JMIR Med Inform. 2021-10-25

[7]
DOME: Directional medical embedding vectors from Electronic Health Records.

J Biomed Inform. 2025-2

[8]
Automated feature selection of predictors in electronic medical records data.

Biometrics. 2019-3

[9]
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022-2-1

[10]
Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison.

JMIR Med Inform. 2021-11-26

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索