Xu Zhenhui, Zhao Congwen, Scales Charles D, Henao Ricardo, Goldstein Benjamin A
Department of Biostatistics and Bioinformatics, Duke University, 2424 Erwin Road, Suite 1104, Durham, NC, 27705, USA.
Duke Clinical Research Institute, Duke University, Durham, NC, USA.
BMC Med Inform Decis Mak. 2022 Apr 24;22(1):110. doi: 10.1186/s12911-022-01855-0.
In the early stages of the COVID-19 pandemic our institution was interested in forecasting how long surgical patients receiving elective procedures would spend in the hospital. Initial examination of our models indicated that, due to the skewed nature of the length of stay, accurate prediction was challenging and we instead opted for a simpler classification model. In this work we perform a deeper examination of predicting in-hospital length of stay.
We used electronic health record data on length of stay from 42,209 elective surgeries. We compare different loss-functions (mean squared error, mean absolute error, mean relative error), algorithms (LASSO, Random Forests, multilayer perceptron) and data transformations (log and truncation). We also assess the performance of two stage hybrid classification-regression approach.
Our results show that while it is possible to accurately predict short length of stays, predicting longer length of stay is extremely challenging. As such, we opt for a two-stage model that first classifies patients into long versus short length of stays and then a second stage that fits a regresssor among those predicted to have a short length of stay.
The results indicate both the challenges and considerations necessary to applying machine-learning methods to skewed outcomes.
Two-stage models allow those developing clinical decision support tools to explicitly acknowledge where they can and cannot make accurate predictions.
在新冠疫情早期,我们机构对预测接受择期手术的外科患者住院时长很感兴趣。对我们模型的初步检查表明,由于住院时长的分布具有偏态性,准确预测具有挑战性,因此我们选择了一个更简单的分类模型。在这项工作中,我们对预测住院时长进行了更深入的研究。
我们使用了42209例择期手术患者住院时长的电子健康记录数据。我们比较了不同的损失函数(均方误差、平均绝对误差、平均相对误差)、算法(套索回归、随机森林、多层感知器)和数据变换(对数变换和截断)。我们还评估了两阶段混合分类回归方法的性能。
我们的结果表明,虽然有可能准确预测短住院时长,但预测长住院时长极具挑战性。因此,我们选择了一个两阶段模型,该模型首先将患者分为长住院时长和短住院时长两类,然后在预测为短住院时长的患者中拟合一个回归模型。
结果表明了将机器学习方法应用于偏态结果时所面临的挑战和需要考虑的因素。
两阶段模型使开发临床决策支持工具的人员能够明确认识到哪些地方可以做出准确预测,哪些地方不能。