电子病历数据中预测指标的自动特征选择

Automated feature selection of predictors in electronic medical records data.

作者信息

Gronsbell Jessica, Minnier Jessica, Yu Sheng, Liao Katherine, Cai Tianxi

机构信息

Department of Biomedical Data Science, Stanford University, Stanford, California.

OHSU-PSU School of Public Health, Oregon Health & Science University, Portland, Oregon.

出版信息

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.

DOI:10.1111/biom.12987

PMID:30353541

Abstract

The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.

摘要

由于难以提取准确的疾病表型数据，将电子健康记录（EHR）用于转化研究可能具有挑战性。从历史上看，用于注释表型的EHR算法要么是基于规则的，要么是使用计费代码和通过劳动密集型病历审查精心策划的金标准标签进行训练的。由于计费代码不精确，这些简单的算法在不同机构之间往往具有不可预测的可移植性，并且对于许多疾病表型的准确性较低。最近，已经开发出更复杂的机器学习算法来提高EHR表型算法的稳健性和准确性。这些算法通常通过监督学习进行训练，将金标准标签与广泛的候选特征相关联，包括计费代码、程序代码、药物处方以及通过自然语言处理（NLP）从叙述性笔记中提取的相关临床概念。然而，由于金标准标签标注的时间密集性，训练集的规模往往不足以构建一个具有从EHR中提取的大量候选特征的可推广算法。为了减少候选预测变量的数量，进而提高模型性能，我们提出了一种完全基于未标记观察结果的自动特征选择方法。所提出的方法通过基于几个高度预测性特征（如诊断代码和在整个EHR数据集中文本字段中提及的疾病）对疾病状态进行无监督聚类，为潜在表型生成一个全面的替代物。然后使用估计的结果和其余协变量构建一个稀疏回归模型，以识别那些对感兴趣的表型最具信息性的特征。基于Li和Duan（1989）的结果，我们证明了通过拟合基于替代物的模型可以实现潜在表型模型的变量选择。我们在数值模拟中探索了我们方法的性能，并展示了一个基于来自合作伙伴健康系统的大型EHR数据集市构建的类风湿性关节炎（RA）预测模型的结果，该数据集市包含计费代码和NLP术语。实证结果表明，我们的方法减少了表型分析所需的金标准标签数量，从而利用了EHR数据的自动处理能力并提高了效率。

相似文献

Automated feature selection of predictors in electronic medical records data.电子病历数据中预测指标的自动特征选择

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.

Weakly Semi-supervised phenotyping using Electronic Health records.基于电子健康记录的弱监督表型研究

J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.

Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.迈向高通量表型分析：从知识源中进行无偏自动特征提取与选择。

J Am Med Inform Assoc. 2015 Sep;22(5):993-1000. doi: 10.1093/jamia/ocv034. Epub 2015 Apr 29.

Surrogate-assisted feature extraction for high-throughput phenotyping.用于高通量表型分析的代理辅助特征提取

J Am Med Inform Assoc. 2017 Apr 1;24(e1):e143-e149. doi: 10.1093/jamia/ocw135.

Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals.评估电子健康记录数据源及识别高血压个体的算法方法。

J Am Med Inform Assoc. 2017 Jan;24(1):162-171. doi: 10.1093/jamia/ocw071. Epub 2016 Aug 7.

Semi-supervised validation of multiple surrogate outcomes with application to electronic medical records phenotyping.多替代结局的半监督验证及其在电子病历表型分析中的应用

Biometrics. 2019 Mar;75(1):78-89. doi: 10.1111/biom.12971. Epub 2019 Mar 8.

Feature extraction for phenotyping from semantic and knowledge resources.从语义和知识资源中进行表型特征提取。

J Biomed Inform. 2019 Mar;91:103122. doi: 10.1016/j.jbi.2019.103122. Epub 2019 Feb 7.

Scalable relevance ranking algorithm via semantic similarity assessment improves efficiency of medical chart review.通过语义相似性评估的可扩展相关性排序算法提高了医学图表审查的效率。

J Biomed Inform. 2022 Aug;132:104109. doi: 10.1016/j.jbi.2022.104109. Epub 2022 Jun 1.

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP).使用一种常见的半监督方法（PheCAP）对电子病历数据进行高通量表型分析。

Nat Protoc. 2019 Dec;14(12):3426-3444. doi: 10.1038/s41596-019-0227-6. Epub 2019 Nov 20.

ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis.ARCH：通过聚合叙事编码健康记录分析构建大规模知识图谱

medRxiv. 2023 May 21:2023.05.14.23289955. doi: 10.1101/2023.05.14.23289955.

引用本文的文献

Label efficient phenotyping for Long COVID using electronic health records.利用电子健康记录对长期新冠进行高效的表型分析。

NPJ Digit Med. 2025 Jul 4;8(1):405. doi: 10.1038/s41746-025-01617-y.

Utilization of Computable Phenotypes in Electronic Health Record Research: A Review and Case Study in Atopic Dermatitis.电子健康记录研究中可计算表型的应用：以特应性皮炎为例的综述与案例研究

J Invest Dermatol. 2025 May;145(5):1008-1016. doi: 10.1016/j.jid.2024.08.025. Epub 2024 Nov 1.

Conceptualizing Patient as an Organization With the Adoption of Digital Health.将患者概念化为采用数字健康技术的组织。

Biomed Eng Comput Biol. 2024 Sep 24;15:11795972241277292. doi: 10.1177/11795972241277292. eCollection 2024.

Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction.用于高维风险预测的代理辅助半监督推理

J Mach Learn Res. 2023 Jan-Dec;24.

A data-driven approach to decode metabolic dysfunction-associated steatotic liver disease.一种基于数据驱动的方法来解码代谢功能障碍相关脂肪性肝病。

Ann Hepatol. 2024 Mar-Apr;29(2):101278. doi: 10.1016/j.aohep.2023.101278. Epub 2023 Dec 20.

Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms.半监督 ROC 分析用于可靠且精简的表型算法评估。

J Am Med Inform Assoc. 2024 Feb 16;31(3):640-650. doi: 10.1093/jamia/ocad226.

Artificial Intelligence in Rheumatoid Arthritis: Current Status and Future Perspectives: A State-of-the-Art Review.类风湿关节炎中的人工智能：现状与未来展望：一篇最新综述

Rheumatol Ther. 2022 Oct;9(5):1249-1304. doi: 10.1007/s40744-022-00475-4. Epub 2022 Jul 18.

Development and Assessment of an Interpretable Machine Learning Triage Tool for Estimating Mortality After Emergency Admissions.开发和评估一种可解释的机器学习分诊工具，用于估算急诊入院后的死亡率。

JAMA Netw Open. 2021 Aug 2;4(8):e2118467. doi: 10.1001/jamanetworkopen.2021.18467.

Comparative effectiveness of medical concept embedding for feature engineering in phenotyping.医学概念嵌入在表型分析中进行特征工程的比较有效性。

JAMIA Open. 2021 Jun 16;4(2):ooab028. doi: 10.1093/jamiaopen/ooab028. eCollection 2021 Apr.

A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications.一种基于遗传算法和世界竞争竞赛算法的机器学习方法，用于在生物应用中选择基因或特征。

Sci Rep. 2021 Feb 8;11(1):3349. doi: 10.1038/s41598-021-82796-y.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

电子病历数据中预测指标的自动特征选择

Automated feature selection of predictors in electronic medical records data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献