Suppr超能文献

电子病历数据中预测指标的自动特征选择

Automated feature selection of predictors in electronic medical records data.

作者信息

Gronsbell Jessica, Minnier Jessica, Yu Sheng, Liao Katherine, Cai Tianxi

机构信息

Department of Biomedical Data Science, Stanford University, Stanford, California.

OHSU-PSU School of Public Health, Oregon Health & Science University, Portland, Oregon.

出版信息

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.

Abstract

The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.

摘要

由于难以提取准确的疾病表型数据,将电子健康记录(EHR)用于转化研究可能具有挑战性。从历史上看,用于注释表型的EHR算法要么是基于规则的,要么是使用计费代码和通过劳动密集型病历审查精心策划的金标准标签进行训练的。由于计费代码不精确,这些简单的算法在不同机构之间往往具有不可预测的可移植性,并且对于许多疾病表型的准确性较低。最近,已经开发出更复杂的机器学习算法来提高EHR表型算法的稳健性和准确性。这些算法通常通过监督学习进行训练,将金标准标签与广泛的候选特征相关联,包括计费代码、程序代码、药物处方以及通过自然语言处理(NLP)从叙述性笔记中提取的相关临床概念。然而,由于金标准标签标注的时间密集性,训练集的规模往往不足以构建一个具有从EHR中提取的大量候选特征的可推广算法。为了减少候选预测变量的数量,进而提高模型性能,我们提出了一种完全基于未标记观察结果的自动特征选择方法。所提出的方法通过基于几个高度预测性特征(如诊断代码和在整个EHR数据集中文本字段中提及的疾病)对疾病状态进行无监督聚类,为潜在表型生成一个全面的替代物。然后使用估计的结果和其余协变量构建一个稀疏回归模型,以识别那些对感兴趣的表型最具信息性的特征。基于Li和Duan(1989)的结果,我们证明了通过拟合基于替代物的模型可以实现潜在表型模型的变量选择。我们在数值模拟中探索了我们方法的性能,并展示了一个基于来自合作伙伴健康系统的大型EHR数据集市构建的类风湿性关节炎(RA)预测模型的结果,该数据集市包含计费代码和NLP术语。实证结果表明,我们的方法减少了表型分析所需的金标准标签数量,从而利用了EHR数据的自动处理能力并提高了效率。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验