Phe2vec：基于电子健康记录的无监督嵌入进行自动疾病表型分析。

Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records.

作者信息

De Freitas Jessica K, Johnson Kipp W, Golden Eddye, Nadkarni Girish N, Dudley Joel T, Bottinger Erwin P, Glicksberg Benjamin S, Miotto Riccardo

机构信息

Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA.

Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA.

出版信息

Patterns (N Y). 2021 Sep 2;2(9):100337. doi: 10.1016/j.patter.2021.100337. eCollection 2021 Sep 10.

DOI:10.1016/j.patter.2021.100337

PMID:34553174

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8441576/

Abstract

Robust phenotyping of patients from electronic health records (EHRs) at scale is a challenge in clinical informatics. Here, we introduce Phe2vec, an automated framework for disease phenotyping from EHRs based on unsupervised learning and assess its effectiveness against standard rule-based algorithms from Phenotype KnowledgeBase (PheKB). Phe2vec is based on pre-computing embeddings of medical concepts and patients' clinical history. Disease phenotypes are then derived from a seed concept and its neighbors in the embedding space. Patients are linked to a disease if their embedded representation is close to the disease phenotype. Comparing Phe2vec and PheKB cohorts head-to-head using chart review, Phe2vec performed on par or better in nine out of ten diseases. Differently from other approaches, it can scale to any condition and was validated against widely adopted expert-based standards. Phe2vec aims to optimize clinical informatics research by augmenting current frameworks to characterize patients by condition and derive reliable disease cohorts.

摘要

大规模对电子健康记录（EHR）中的患者进行稳健的表型分析是临床信息学中的一项挑战。在此，我们介绍Phe2vec，这是一个基于无监督学习从电子健康记录中进行疾病表型分析的自动化框架，并评估其相对于来自表型知识库（PheKB）的标准基于规则算法的有效性。Phe2vec基于预先计算医学概念和患者临床病史的嵌入。然后从嵌入空间中的一个种子概念及其邻居推导出疾病表型。如果患者的嵌入表示接近疾病表型，则将其与一种疾病相关联。使用图表审查将Phe2vec和PheKB队列进行直接比较，在十种疾病中的九种疾病中，Phe2vec的表现相当或更好。与其他方法不同，它可以扩展到任何疾病，并已根据广泛采用的基于专家的标准进行了验证。Phe2vec旨在通过增强当前框架来优化临床信息学研究，以按疾病特征描述患者并得出可靠的疾病队列。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc06/8441576/9468bb1e7c22/gr1.jpg

相似文献

Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records.Phe2vec：基于电子健康记录的无监督嵌入进行自动疾病表型分析。

Patterns (N Y). 2021 Sep 2;2(9):100337. doi: 10.1016/j.patter.2021.100337. eCollection 2021 Sep 10.

Automated disease cohort selection using word embeddings from Electronic Health Records.利用电子健康记录中的词嵌入进行疾病队列自动选择。

Pac Symp Biocomput. 2018;23:145-156.

Feature extraction for phenotyping from semantic and knowledge resources.从语义和知识资源中进行表型特征提取。

J Biomed Inform. 2019 Mar;91:103122. doi: 10.1016/j.jbi.2019.103122. Epub 2019 Feb 7.

HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.HPO2Vec+：利用异构知识资源丰富人类表型本体的节点嵌入。

J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.

Scalable relevance ranking algorithm via semantic similarity assessment improves efficiency of medical chart review.通过语义相似性评估的可扩展相关性排序算法提高了医学图表审查的效率。

J Biomed Inform. 2022 Aug;132:104109. doi: 10.1016/j.jbi.2022.104109. Epub 2022 Jun 1.

MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record.混合 EHR 引导：一种使用电子健康记录进行大规模自动表型分析的引导式多模态主题建模方法。

J Biomed Inform. 2022 Oct;134:104190. doi: 10.1016/j.jbi.2022.104190. Epub 2022 Sep 1.

Evaluating resources composing the PheMAP knowledge base to enhance high-throughput phenotyping.评估构成 Phenotype Map (PheMAP) 知识库的资源，以增强高通量表型分析。

J Am Med Inform Assoc. 2023 Feb 16;30(3):456-465. doi: 10.1093/jamia/ocac234.

Developing a FHIR-based EHR phenotyping framework: A case study for identification of patients with obesity and multiple comorbidities from discharge summaries.基于 FHIR 的电子健康记录表型框架的开发：以从出院小结中识别肥胖且伴有多种合并症的患者为例。

J Biomed Inform. 2019 Nov;99:103310. doi: 10.1016/j.jbi.2019.103310. Epub 2019 Oct 14.

Combining unsupervised, supervised and rule-based learning: the case of detecting patient allergies in electronic health records.结合无监督、监督和基于规则的学习：以电子健康记录中检测患者过敏为例。

BMC Med Inform Decis Mak. 2023 Sep 18;23(1):188. doi: 10.1186/s12911-023-02271-8.

Machine learning approaches for electronic health records phenotyping: a methodical review.基于机器学习的电子健康记录表型分析方法：系统评价

J Am Med Inform Assoc. 2023 Jan 18;30(2):367-381. doi: 10.1093/jamia/ocac216.

引用本文的文献

A self-supervised framework for laboratory data imputation in electronic health records.一种用于电子健康记录中实验室数据插补的自监督框架。

Commun Med (Lond). 2025 Jul 1;5(1):251. doi: 10.1038/s43856-025-00973-w.

A scoping review of self-supervised representation learning for clinical decision making using EHR categorical data.一项使用电子健康记录分类数据进行临床决策的自监督表征学习的范围综述。

NPJ Digit Med. 2025 Jun 14;8(1):362. doi: 10.1038/s41746-025-01692-1.

A probabilistic approach for building disease phenotypes across electronic health records.一种基于电子健康记录构建疾病表型的概率方法。

BioData Min. 2025 Jun 11;18(1):39. doi: 10.1186/s13040-025-00454-9.

Pediatric Long COVID Subphenotypes: An EHR-based study from the RECOVER program.儿童长期新冠后遗症亚表型：一项基于电子健康记录的RECOVER项目研究。

PLOS Digit Health. 2025 Apr 10;4(4):e0000747. doi: 10.1371/journal.pdig.0000747. eCollection 2025 Apr.

Pediatric Long COVID Subphenotypes: An EHR-based study from the RECOVER program.儿童新冠长期后遗症亚表型：一项基于电子健康记录的RECOVER项目研究。

medRxiv. 2024 Sep 18:2024.09.17.24313742. doi: 10.1101/2024.09.17.24313742.

Harnessing EHR data for health research.利用电子健康记录数据进行健康研究。

Nat Med. 2024 Jul;30(7):1847-1855. doi: 10.1038/s41591-024-03074-8. Epub 2024 Jul 4.

Automated HIV Case Identification from the MIMIC-IV Database.从MIMIC-IV数据库中自动识别艾滋病病例

AMIA Jt Summits Transl Sci Proc. 2024 May 31;2024:555-564. eCollection 2024.

A novel method leveraging time series data to improve subphenotyping and application in critically ill patients with COVID-19.一种利用时间序列数据改进亚表型分析并应用于新冠肺炎危重症患者的新方法。

Artif Intell Med. 2024 Feb;148:102750. doi: 10.1016/j.artmed.2023.102750. Epub 2023 Dec 20.

Cardiometabolic and renal phenotypes and transitions in the United States population.美国人群中的心脏代谢和肾脏表型及转变

Nat Cardiovasc Res. 2023 Dec 15;3(1):46-59. doi: 10.1038/s44161-023-00391-y.

A deep learning transformer model predicts high rates of undiagnosed rare disease in large electronic health systems.一种深度学习变压器模型预测大型电子健康系统中未诊断罕见病的高发生率。

medRxiv. 2023 Dec 24:2023.12.21.23300393. doi: 10.1101/2023.12.21.23300393.

本文引用的文献

Comparative effectiveness of medical concept embedding for feature engineering in phenotyping.医学概念嵌入在表型分析中进行特征工程的比较有效性。

JAMIA Open. 2021 Jun 16;4(2):ooab028. doi: 10.1093/jamiaopen/ooab028. eCollection 2021 Apr.

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.医学BERT：基于大规模结构化电子健康记录进行疾病预测的预训练上下文嵌入模型

NPJ Digit Med. 2021 May 20;4(1):86. doi: 10.1038/s41746-021-00455-y.

PheMap: a multi-resource knowledge base for high-throughput phenotyping within electronic health records.PheMap：一个用于电子健康记录中高通量表型分析的多资源知识库。

J Am Med Inform Assoc. 2020 Nov 1;27(11):1675-1687. doi: 10.1093/jamia/ocaa104.

Deep representation learning of electronic health records to unlock patient stratification at scale.电子健康记录的深度表征学习，以大规模实现患者分层。

NPJ Digit Med. 2020 Jul 17;3:96. doi: 10.1038/s41746-020-0301-z. eCollection 2020.

sureLDA: A multidisease automated phenotyping method for the electronic health record.SureLDA：一种电子健康记录中的多疾病自动化表型方法。

J Am Med Inform Assoc. 2020 Aug 1;27(8):1235-1243. doi: 10.1093/jamia/ocaa079.

BEHRT: Transformer for Electronic Health Records.BEHRT：电子健康记录的转换器。

Sci Rep. 2020 Apr 28;10(1):7155. doi: 10.1038/s41598-020-62922-y.

Polar labeling: silver standard algorithm for training disease classifiers.极性标记：用于训练疾病分类器的银标准算法。

Bioinformatics. 2020 May 1;36(10):3200-3206. doi: 10.1093/bioinformatics/btaa088.

Mapping ICD-10 and ICD-10-CM Codes to Phecodes: Workflow Development and Initial Evaluation.将ICD - 10和ICD - 10 - CM编码映射到疾病编码：工作流程开发与初步评估

JMIR Med Inform. 2019 Nov 29;7(4):e14325. doi: 10.2196/14325.

Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics.鲁比克：用于健康数据分析的知识引导张量分解与补全

KDD. 2015 Aug;2015:1265-1274. doi: 10.1145/2783258.2783395.

Scalable and accurate deep learning with electronic health records.借助电子健康记录实现可扩展且准确的深度学习。

NPJ Digit Med. 2018 May 8;1:18. doi: 10.1038/s41746-018-0029-1. eCollection 2018.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

Phe2vec：基于电子健康记录的无监督嵌入进行自动疾病表型分析。

Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献