School of Statistics, Renmin University of China, 59 Zhong Guan Cun Ave, Hai Dian District, Beijing, People's Republic of China.
Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, 851 S Morgan St, Chicago, IL, 60607, USA.
BMC Med Res Methodol. 2020 Feb 26;20(1):37. doi: 10.1186/s12874-020-00923-1.
The main goal of this study is to explore the use of features representing patient-level electronic health record (EHR) data, generated by the unsupervised deep learning algorithm autoencoder, in predictive modeling. Since autoencoder features are unsupervised, this paper focuses on their general lower-dimensional representation of EHR information in a wide variety of predictive tasks.
We compare the model with autoencoder features to traditional models: logistic model with least absolute shrinkage and selection operator (LASSO) and Random Forest algorithm. In addition, we include a predictive model using a small subset of response-specific variables (Simple Reg) and a model combining these variables with features from autoencoder (Enhanced Reg). We performed the study first on simulated data that mimics real world EHR data and then on actual EHR data from eight Advocate hospitals.
On simulated data with incorrect categories and missing data, the precision for autoencoder is 24.16% when fixing recall at 0.7, which is higher than Random Forest (23.61%) and lower than LASSO (25.32%). The precision is 20.92% in Simple Reg and improves to 24.89% in Enhanced Reg. When using real EHR data to predict the 30-day readmission rate, the precision of autoencoder is 19.04%, which again is higher than Random Forest (18.48%) and lower than LASSO (19.70%). The precisions for Simple Reg and Enhanced Reg are 18.70 and 19.69% respectively. That is, Enhanced Reg can have competitive prediction performance compared to LASSO. In addition, results show that Enhanced Reg usually relies on fewer features under the setting of simulations of this paper.
We conclude that autoencoder can create useful features representing the entire space of EHR data and which are applicable to a wide array of predictive tasks. Together with important response-specific predictors, we can derive efficient and robust predictive models with less labor in data extraction and model training.
本研究的主要目的是探索使用无监督深度学习算法自动编码器生成的代表患者级电子健康记录(EHR)数据的特征进行预测建模。由于自动编码器特征是无监督的,因此本文侧重于它们在各种预测任务中对 EHR 信息的通用低维表示。
我们将具有自动编码器特征的模型与传统模型进行比较:逻辑模型最小绝对收缩和选择算子(LASSO)和随机森林算法。此外,我们还包括一个使用响应特定变量的小子集的预测模型(Simple Reg)和一个将这些变量与自动编码器特征相结合的模型(Enhanced Reg)。我们首先在模拟数据上进行了研究,该模拟数据模拟了真实世界的 EHR 数据,然后在来自八个 Advocate 医院的实际 EHR 数据上进行了研究。
在具有错误类别和缺失数据的模拟数据上,当固定召回率为 0.7 时,自动编码器的精度为 24.16%,高于随机森林(23.61%),低于 LASSO(25.32%)。Simple Reg 的精度为 20.92%,在增强的 Reg 中提高到 24.89%。当使用实际的 EHR 数据来预测 30 天再入院率时,自动编码器的精度为 19.04%,再次高于随机森林(18.48%),低于 LASSO(19.70%)。Simple Reg 和增强的 Reg 的精度分别为 18.70%和 19.69%。也就是说,增强的 Reg 可以与 LASSO 具有竞争力的预测性能。此外,结果表明,在本文模拟的设置下,增强的 Reg 通常依赖于更少的特征。
我们得出结论,自动编码器可以创建有用的特征来表示整个 EHR 数据空间,并适用于广泛的预测任务。与重要的响应特定预测因子一起,我们可以在数据提取和模型训练方面减少工作量,从而获得高效且稳健的预测模型。