Sharafoddini Anis, Dubin Joel A, Maslove David M, Lee Joon
Health Data Science Lab, School of Public Health and Health Systems, University of Waterloo, Waterloo, ON, Canada.
Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, Canada.
JMIR Med Inform. 2019 Jan 8;7(1):e11605. doi: 10.2196/11605.
The data missing from patient profiles in intensive care units (ICUs) are substantial and unavoidable. However, this incompleteness is not always random or because of imperfections in the data collection process.
This study aimed to investigate the potential hidden information in data missing from electronic health records (EHRs) in an ICU and examine whether the presence or missingness of a variable itself can convey information about the patient health status.
Daily retrieval of laboratory test (LT) measurements from the Medical Information Mart for Intensive Care III database was set as our reference for defining complete patient profiles. Missingness indicators were introduced as a way of representing presence or absence of the LTs in a patient profile. Thereafter, various feature selection methods (filter and embedded feature selection methods) were used to examine the predictive power of missingness indicators. Finally, a set of well-known prediction models (logistic regression [LR], decision tree, and random forest) were used to evaluate whether the absence status itself of a variable recording can provide predictive power. We also examined the utility of missingness indicators in improving predictive performance when used with observed laboratory measurements as model input. The outcome of interest was in-hospital mortality and mortality at 30 days after ICU discharge.
Regardless of mortality type or ICU day, more than 40% of the predictors selected by feature selection methods were missingness indicators. Notably, employing missingness indicators as the only predictors achieved reasonable mortality prediction on all days and for all mortality types (for instance, in 30-day mortality prediction with LR, we achieved area under the curve of the receiver operating characteristic [AUROC] of 0.6836±0.012). Including indicators with observed measurements in the prediction models also improved the AUROC; the maximum improvement was 0.0426. Indicators also improved the AUROC for Simplified Acute Physiology Score II model-a well-known ICU severity of illness score-confirming the additive information of the indicators (AUROC of 0.8045±0.0109 for 30-day mortality prediction for LR).
Our study demonstrated that the presence or absence of LT measurements is informative and can be considered a potential predictor of in-hospital and 30-day mortality. The comparative analysis of prediction models also showed statistically significant prediction improvement when indicators were included. Moreover, missing data might reflect the opinions of examining clinicians. Therefore, the absence of measurements can be informative in ICUs and has predictive power beyond the measured data themselves. This initial case study shows promise for more in-depth analysis of missing data and its informativeness in ICUs. Future studies are needed to generalize these results.
重症监护病房(ICU)患者资料中缺失的数据量大且不可避免。然而,这种不完整性并非总是随机的,也并非是由于数据收集过程中的缺陷所致。
本研究旨在调查ICU电子健康记录(EHR)中缺失数据的潜在隐藏信息,并检验变量本身的存在或缺失是否能够传达有关患者健康状况的信息。
将从重症监护医学信息集市III数据库每日检索的实验室检查(LT)测量值作为定义完整患者资料的参考。引入缺失指标,作为表示患者资料中LT存在或不存在的一种方式。此后,使用各种特征选择方法(过滤和嵌入式特征选择方法)来检验缺失指标的预测能力。最后,使用一组知名的预测模型(逻辑回归[LR]、决策树和随机森林)来评估变量记录的缺失状态本身是否能够提供预测能力。我们还检验了在将缺失指标与观察到的实验室测量值作为模型输入一起使用时,其在提高预测性能方面的效用。感兴趣的结局是住院死亡率和ICU出院后30天的死亡率。
无论死亡率类型或ICU天数如何,通过特征选择方法选择的预测变量中超过40%是缺失指标。值得注意的是,仅将缺失指标用作预测变量,在所有天数和所有死亡率类型上均实现了合理的死亡率预测(例如,在使用LR进行30天死亡率预测时,我们获得的受试者工作特征曲线下面积[AUROC]为0.6836±0.012)。在预测模型中纳入带有观察测量值的指标也提高了AUROC;最大提高为0.0426。指标还提高了简化急性生理学评分II模型(一种著名的ICU疾病严重程度评分)的AUROC,证实了指标的附加信息(LR对30天死亡率预测的AUROC为0.8045±0.0109)。
我们的研究表明,LT测量值的存在或缺失具有信息价值,可被视为住院和30天死亡率的潜在预测指标。预测模型的比较分析还显示,纳入指标后预测有统计学显著改善。此外,缺失数据可能反映了检查临床医生的意见。因此,测量值的缺失在ICU中可能具有信息价值,并且具有超出测量数据本身的预测能力。这项初步的案例研究显示了对ICU中缺失数据及其信息价值进行更深入分析的前景。需要进一步的研究来推广这些结果。