基于电子健康记录的表型分析：批量学习与评估

EHR-based phenotyping: Bulk learning and evaluation.

作者信息

Chiu Po-Hsiang, Hripcsak George

机构信息

Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA.

出版信息

J Biomed Inform. 2017 Jun;70:35-51. doi: 10.1016/j.jbi.2017.04.009. Epub 2017 Apr 12.

DOI:10.1016/j.jbi.2017.04.009

PMID:28410982

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5934756/

Abstract

In data-driven phenotyping, a core computational task is to identify medical concepts and their variations from sources of electronic health records (EHR) to stratify phenotypic cohorts. A conventional analytic framework for phenotyping largely uses a manual knowledge engineering approach or a supervised learning approach where clinical cases are represented by variables encompassing diagnoses, medicinal treatments and laboratory tests, among others. In such a framework, tasks associated with feature engineering and data annotation remain a tedious and expensive exercise, resulting in poor scalability. In addition, certain clinical conditions, such as those that are rare and acute in nature, may never accumulate sufficient data over time, which poses a challenge to establishing accurate and informative statistical models. In this paper, we use infectious diseases as the domain of study to demonstrate a hierarchical learning method based on ensemble learning that attempts to address these issues through feature abstraction. We use a sparse annotation set to train and evaluate many phenotypes at once, which we call bulk learning. In this batch-phenotyping framework, disease cohort definitions can be learned from within the abstract feature space established by using multiple diseases as a substrate and diagnostic codes as surrogates. In particular, using surrogate labels for model training renders possible its subsequent evaluation using only a sparse annotated sample. Moreover, statistical models can be trained and evaluated, using the same sparse annotation, from within the abstract feature space of low dimensionality that encapsulates the shared clinical traits of these target diseases, collectively referred to as the bulk learning set.

摘要

在数据驱动的表型分析中，一个核心计算任务是从电子健康记录（EHR）源中识别医学概念及其变体，以对表型队列进行分层。传统的表型分析框架主要使用手动知识工程方法或监督学习方法，其中临床病例由包括诊断、药物治疗和实验室检查等变量表示。在这样的框架中，与特征工程和数据注释相关的任务仍然是一项繁琐且昂贵的工作，导致可扩展性较差。此外，某些临床病症，例如那些罕见且急性的病症，可能永远无法随着时间积累足够的数据，这对建立准确且信息丰富的统计模型构成了挑战。在本文中，我们以传染病作为研究领域，展示一种基于集成学习的分层学习方法，该方法试图通过特征抽象来解决这些问题。我们使用一个稀疏注释集来一次性训练和评估多个表型，我们将其称为批量学习。在这个批量表型分析框架中，可以从以多种疾病为基础、诊断代码为替代物建立的抽象特征空间中学习疾病队列定义。特别是，使用替代标签进行模型训练使得随后仅使用稀疏注释样本进行评估成为可能。此外，可以在封装这些目标疾病共同临床特征的低维抽象特征空间内（统称为批量学习集），使用相同的稀疏注释来训练和评估统计模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3cda/5934756/e0925a80bde6/nihms962422f1.jpg

相似文献

EHR-based phenotyping: Bulk learning and evaluation.基于电子健康记录的表型分析：批量学习与评估

J Biomed Inform. 2017 Jun;70:35-51. doi: 10.1016/j.jbi.2017.04.009. Epub 2017 Apr 12.

Weakly Semi-supervised phenotyping using Electronic Health records.基于电子健康记录的弱监督表型研究

J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.

Automated feature selection of predictors in electronic medical records data.电子病历数据中预测指标的自动特征选择

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.

Relational machine learning for electronic health record-driven phenotyping.用于电子健康记录驱动的表型分析的关系机器学习。

J Biomed Inform. 2014 Dec;52:260-70. doi: 10.1016/j.jbi.2014.07.007. Epub 2014 Jul 15.

Semi-supervised learning of the electronic health record for phenotype stratification.用于表型分层的电子健康记录的半监督学习

J Biomed Inform. 2016 Dec;64:168-178. doi: 10.1016/j.jbi.2016.10.007. Epub 2016 Oct 12.

Surrogate-assisted feature extraction for high-throughput phenotyping.用于高通量表型分析的代理辅助特征提取

J Am Med Inform Assoc. 2017 Apr 1;24(e1):e143-e149. doi: 10.1093/jamia/ocw135.

Enabling phenotypic big data with PheNorm.利用 PheNorm 实现表型大数据。

J Am Med Inform Assoc. 2018 Jan 1;25(1):54-60. doi: 10.1093/jamia/ocx111.

Feature extraction for phenotyping from semantic and knowledge resources.从语义和知识资源中进行表型特征提取。

J Biomed Inform. 2019 Mar;91:103122. doi: 10.1016/j.jbi.2019.103122. Epub 2019 Feb 7.

Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: A cross-sectional, unselected, retrospective study.电子健康记录表型分析改善了美国普通人群中2型糖尿病的检测和筛查：一项横断面、非选择性、回顾性研究。

J Biomed Inform. 2016 Apr;60:162-8. doi: 10.1016/j.jbi.2015.12.006. Epub 2015 Dec 17.

Applying active learning to high-throughput phenotyping algorithms for electronic health records data.将主动学习应用于电子健康记录数据的高通量表型算法。

J Am Med Inform Assoc. 2013 Dec;20(e2):e253-9. doi: 10.1136/amiajnl-2013-001945. Epub 2013 Jul 13.

引用本文的文献

Artificial Intelligence in Biomedical Sciences: A Scoping Review.生物医学科学中的人工智能：一项范围综述

Br J Biomed Sci. 2025 Aug 5;82:14362. doi: 10.3389/bjbs.2025.14362. eCollection 2025.

Transformer patient embedding using electronic health records enables patient stratification and progression analysis.使用电子健康记录的Transformer患者嵌入可实现患者分层和病情进展分析。

NPJ Digit Med. 2025 Aug 14;8(1):521. doi: 10.1038/s41746-025-01872-z.

Using a Healthcare Process Modeling Approach to Understand Electronic Health Records-based Pressure Injury Data and to Support Development of a Standardized Pressure Injury Phenotyping Pipeline.采用医疗保健流程建模方法来理解基于电子健康记录的压力性损伤数据，并支持标准化压力性损伤表型分析流程的开发。

AMIA Annu Symp Proc. 2025 May 22;2024:738-747. eCollection 2024.

Language-model-based patient embedding using electronic health records facilitates phenotyping, disease forecasting, and progression analysis.利用电子健康记录基于语言模型的患者嵌入有助于表型分析、疾病预测和病情进展分析。

Res Sq. 2024 Sep 23:rs.3.rs-4708839. doi: 10.21203/rs.3.rs-4708839/v1.

Opportunities and challenges for biomarker discovery using electronic health record data.利用电子健康记录数据发现生物标志物的机遇与挑战。

Trends Mol Med. 2023 Sep;29(9):765-776. doi: 10.1016/j.molmed.2023.06.006. Epub 2023 Jul 18.

Federated Learning in Health care Using Structured Medical Data.利用结构化医疗数据进行医疗保健中的联邦学习。

Adv Kidney Dis Health. 2023 Jan;30(1):4-16. doi: 10.1053/j.akdh.2022.11.007.

Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation.利用序列基序发现工具识别表型叙述的语言模式对中国电子健康记录进行深度表型分析：算法开发与验证

J Med Internet Res. 2022 Jun 3;24(6):e37213. doi: 10.2196/37213.

Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records.Phe2vec：基于电子健康记录的无监督嵌入进行自动疾病表型分析。

Patterns (N Y). 2021 Sep 2;2(9):100337. doi: 10.1016/j.patter.2021.100337. eCollection 2021 Sep 10.

Defining Phenotypes from Clinical Data to Drive Genomic Research.从临床数据定义表型以推动基因组研究。

Annu Rev Biomed Data Sci. 2018 Jul;1:69-92. doi: 10.1146/annurev-biodatasci-080917-013335. Epub 2018 Apr 25.

High-throughput phenotyping with temporal sequences.高通量表型分析与时间序列。

J Am Med Inform Assoc. 2021 Mar 18;28(4):772-781. doi: 10.1093/jamia/ocaa288.

本文引用的文献

Learning statistical models of phenotypes using noisy labeled training data.使用带有噪声标签的训练数据学习表型的统计模型。

J Am Med Inform Assoc. 2016 Nov;23(6):1166-1173. doi: 10.1093/jamia/ocw028. Epub 2016 May 12.

Electronic medical record phenotyping using the anchor and learn framework.使用锚定与学习框架进行电子病历表型分析。

J Am Med Inform Assoc. 2016 Jul;23(4):731-40. doi: 10.1093/jamia/ocw011. Epub 2016 Apr 23.

Deep phenotyping: The details of disease.深度表型分析：疾病的细节

Nature. 2015 Nov 5;527(7576):S14-5. doi: 10.1038/527S14a.

Identification of type 2 diabetes subgroups through topological analysis of patient similarity.通过患者相似性的拓扑分析识别2型糖尿病亚组。

Sci Transl Med. 2015 Oct 28;7(311):311ra174. doi: 10.1126/scitranslmed.aaa9364.

Learning probabilistic phenotypes from heterogeneous EHR data.从异构电子健康记录数据中学习概率性表型。

J Biomed Inform. 2015 Dec;58:156-165. doi: 10.1016/j.jbi.2015.10.001. Epub 2015 Oct 14.

Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers.观察性健康数据科学与信息学（OHDSI）：观察性研究人员的机遇。

Stud Health Technol Inform. 2015;216:574-8.

Using Anchors to Estimate Clinical State without Labeled Data.使用锚点在无标记数据的情况下估计临床状态。

AMIA Annu Symp Proc. 2014 Nov 14;2014:606-15. eCollection 2014.

Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.迈向高通量表型分析：从知识源中进行无偏自动特征提取与选择。

J Am Med Inform Assoc. 2015 Sep;22(5):993-1000. doi: 10.1093/jamia/ocv034. Epub 2015 Apr 29.

Representation learning: a unified deep learning framework for automatic prostate MR segmentation.表征学习：一种用于前列腺磁共振自动分割的统一深度学习框架。

Med Image Comput Comput Assist Interv. 2013;16(Pt 2):254-61. doi: 10.1007/978-3-642-40763-5_32.

A review of approaches to identifying patient phenotype cohorts using electronic health records.利用电子健康记录识别患者表型队列的方法综述。

J Am Med Inform Assoc. 2014 Mar-Apr;21(2):221-30. doi: 10.1136/amiajnl-2013-001935. Epub 2013 Nov 7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验