文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

Automated feature selection of predictors in electronic medical records data.

作者信息

Gronsbell Jessica, Minnier Jessica, Yu Sheng, Liao Katherine, Cai Tianxi

机构信息

Department of Biomedical Data Science, Stanford University, Stanford, California.

OHSU-PSU School of Public Health, Oregon Health & Science University, Portland, Oregon.

出版信息

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.


DOI:10.1111/biom.12987
PMID:30353541
Abstract

The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.

摘要

相似文献

[1]
Automated feature selection of predictors in electronic medical records data.

Biometrics. 2019-3

[2]
Weakly Semi-supervised phenotyping using Electronic Health records.

J Biomed Inform. 2022-10

[3]
Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.

J Am Med Inform Assoc. 2015-9

[4]
Surrogate-assisted feature extraction for high-throughput phenotyping.

J Am Med Inform Assoc. 2017-4-1

[5]
Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals.

J Am Med Inform Assoc. 2017-1

[6]
Semi-supervised validation of multiple surrogate outcomes with application to electronic medical records phenotyping.

Biometrics. 2019-3

[7]
Feature extraction for phenotyping from semantic and knowledge resources.

J Biomed Inform. 2019-2-7

[8]
Scalable relevance ranking algorithm via semantic similarity assessment improves efficiency of medical chart review.

J Biomed Inform. 2022-8

[9]
High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP).

Nat Protoc. 2019-11-20

[10]
ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis.

medRxiv. 2023-5-21

引用本文的文献

[1]
Label efficient phenotyping for Long COVID using electronic health records.

NPJ Digit Med. 2025-7-4

[2]
Utilization of Computable Phenotypes in Electronic Health Record Research: A Review and Case Study in Atopic Dermatitis.

J Invest Dermatol. 2025-5

[3]
Conceptualizing Patient as an Organization With the Adoption of Digital Health.

Biomed Eng Comput Biol. 2024-9-24

[4]
Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction.

J Mach Learn Res. 2023

[5]
A data-driven approach to decode metabolic dysfunction-associated steatotic liver disease.

Ann Hepatol. 2024

[6]
Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms.

J Am Med Inform Assoc. 2024-2-16

[7]
Artificial Intelligence in Rheumatoid Arthritis: Current Status and Future Perspectives: A State-of-the-Art Review.

Rheumatol Ther. 2022-10

[8]
Development and Assessment of an Interpretable Machine Learning Triage Tool for Estimating Mortality After Emergency Admissions.

JAMA Netw Open. 2021-8-2

[9]
Comparative effectiveness of medical concept embedding for feature engineering in phenotyping.

JAMIA Open. 2021-6-16

[10]
A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications.

Sci Rep. 2021-2-8

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索