Center for Health Informatics and Technology, The Maersk Mc-Kinney Institute, University of Southern Denmark, Odense, Denmark.
Department of Mathematics and Computer Science (IMADA), University of Southern Denmark, Odense, Denmark.
BMC Med Inform Decis Mak. 2021 Oct 30;21(1):298. doi: 10.1186/s12911-021-01660-1.
Prediction of length of stay (LOS) at admission time can provide physicians and nurses insight into the illness severity of patients and aid them in avoiding adverse events and clinical deterioration. It also assists hospitals with more effectively managing their resources and manpower.
In this field of research, there are some important challenges, such as missing values and LOS data skewness. Moreover, various studies use a binary classification which puts a wide range of patients with different conditions into one category. To address these shortcomings, first multivariate imputation techniques are applied to fill incomplete records, then two proper resampling techniques, namely Borderline-SMOTE and SMOGN, are applied to address data skewness in the classification and regression domains, respectively. Finally, machine learning (ML) techniques including neural networks, extreme gradient boosting, random forest, support vector machine, and decision tree are implemented for both approaches to predict LOS of patients admitted to the Emergency Department of Odense University Hospital between June 2018 and April 2019. The ML models are developed based on data obtained from patients at admission time, including pulse rate, arterial blood oxygen saturation, respiratory rate, systolic blood pressure, triage category, arrival ICD-10 codes, age, and gender.
The performance of predictive models before and after addressing missing values and data skewness is evaluated using four evaluation metrics namely receiver operating characteristic, area under the curve (AUC), R-squared score (R), and normalized root mean square error (NRMSE). Results show that the performance of predictive models is improved on average by 15.75% for AUC, 32.19% for R score, and 11.32% for NRMSE after addressing the mentioned challenges. Moreover, our results indicate that there is a relationship between the missing values rate, data skewness, and illness severity of patients, so it is clinically essential to take incomplete records of patients into account and apply proper solutions for interpolation of missing values.
We propose a new method comprised of three stages: missing values imputation, data skewness handling, and building predictive models based on classification and regression approaches. Our results indicated that addressing these challenges in a proper way enhanced the performance of models significantly, which led to a more valid prediction of LOS.
在入院时预测住院时间(LOS)可以让医生和护士了解患者的疾病严重程度,并帮助他们避免不良事件和临床恶化。它还可以帮助医院更有效地管理资源和人力。
在这个研究领域,存在一些重要的挑战,例如缺失值和 LOS 数据偏度。此外,各种研究使用二元分类,将不同条件的广泛患者归入一个类别。为了解决这些缺点,首先应用多元插补技术来填补不完整的记录,然后应用两种适当的重采样技术,即边界-SMOTE 和 SMOGN,分别在分类和回归领域解决数据偏度问题。最后,应用机器学习(ML)技术,包括神经网络、极端梯度提升、随机森林、支持向量机和决策树,用于这两种方法来预测 2018 年 6 月至 2019 年 4 月期间在奥登塞大学医院急诊科入院的患者的 LOS。ML 模型是基于患者入院时的数据开发的,包括脉搏率、动脉血氧饱和度、呼吸频率、收缩压、分诊类别、到达 ICD-10 代码、年龄和性别。
使用四个评估指标,即接收者操作特征、曲线下面积(AUC)、R 平方得分(R)和归一化均方根误差(NRMSE),评估了在解决缺失值和数据偏度前后预测模型的性能。结果表明,在解决了所述挑战后,AUC 的平均性能提高了 15.75%,R 得分提高了 32.19%,NRMSE 提高了 11.32%。此外,我们的结果表明,缺失值率、数据偏度和患者的疾病严重程度之间存在关系,因此,考虑患者的不完整记录并应用适当的缺失值插值解决方案在临床上是必要的。
我们提出了一种由三个阶段组成的新方法:缺失值插补、数据偏度处理和基于分类和回归方法构建预测模型。我们的结果表明,以适当的方式解决这些挑战可以显著提高模型的性能,从而更有效地预测 LOS。