IEEE J Biomed Health Inform. 2019 May;23(3):1243-1250. doi: 10.1109/JBHI.2018.2883606. Epub 2019 Apr 16.
The diversity and number of parameters monitored in an intensive care unit (ICU) make the resulting databases highly susceptible to quality issues, such as missing information and erroneous data entry, which adversely affect the downstream processing and predictive modeling. Missing data interpolation and imputation techniques, such as multiple imputation, expectation maximization, and hot-deck imputation techniques do not account for the type of missing data, which can lead to bias. In our study, we first model the missing data as three types: "neglectable" also known as a.k.a "missing completely at random," "recoverable" a.k.a. "missing at random," and "not easily recoverable" a.k.a. "missing not at random." We then design imputation techniques for each type of missing data. We use a publicly available database (MIMIC II) to demonstrate how these imputations perform with random forests for prediction. Our results indicate that these novel imputation techniques outperformed standard mean filling techniques and expectation maximization with a statistical significance p ≤ 0.01 in predicting ICU mortality.
重症监护病房 (ICU) 中监测的参数种类繁多,数量庞大,这使得由此产生的数据库非常容易出现质量问题,如信息缺失和数据录入错误,这会对下游处理和预测建模产生不利影响。缺失数据插补和估算技术(如多重插补、期望最大化和热插补技术)并没有考虑缺失数据的类型,这可能会导致偏差。在我们的研究中,我们首先将缺失数据建模为三种类型:“可忽略”也称为“完全随机缺失”,“可恢复”也称为“随机缺失”,以及“不易恢复”也称为“非随机缺失”。然后,我们为每种类型的缺失数据设计了估算技术。我们使用一个公开的数据库(MIMIC II)来演示这些估算方法如何与随机森林一起用于预测。我们的结果表明,这些新的估算技术在预测 ICU 死亡率方面优于标准均值填充技术和期望最大化技术,具有统计学意义 p ≤ 0.01。