Department of Medical Informatics, Erasmus University Medical Center, Doctor Molewaterplein 40, 3015 GD, Rotterdam, the Netherlands.
Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
BMC Med Res Methodol. 2023 Dec 7;23(1):285. doi: 10.1186/s12874-023-02112-2.
Deep learning models have had a lot of success in various fields. However, on structured data they have struggled. Here we apply four state-of-the-art supervised deep learning models using the attention mechanism and compare against logistic regression and XGBoost using discrimination, calibration and clinical utility.
We develop the models using a general practitioners database. We implement a recurrent neural network, a transformer with and without reverse distillation and a graph neural network. We measure discrimination using the area under the receiver operating characteristic curve (AUC) and the area under the precision recall curve (AUPRC). We assess smooth calibration using restricted cubic splines and clinical utility with decision curve analysis.
Our results show that deep learning approaches can improve discrimination up to 2.5% points AUC and 7.4% points AUPRC. However, on average the baselines are competitive. Most models are similarly calibrated as the baselines except for the graph neural network. The transformer using reverse distillation shows the best performance in clinical utility on two out of three prediction problems over most of the prediction thresholds.
In this study, we evaluated various approaches in supervised learning using neural networks and attention. Here we do a rigorous comparison, not only looking at discrimination but also calibration and clinical utility. There is value in using deep learning models on electronic health record data since it can improve discrimination and clinical utility while providing good calibration. However, good baseline methods are still competitive.
深度学习模型在各个领域都取得了很大的成功。然而,在结构化数据方面,它们的表现并不理想。在这里,我们应用了四种最先进的基于监督的深度学习模型,使用注意力机制,并与逻辑回归和 XGBoost 进行比较,比较的指标包括判别能力、校准和临床实用性。
我们使用全科医生数据库来开发模型。我们实现了一个递归神经网络、一个带有和不带有反向蒸馏的转换器以及一个图神经网络。我们使用接收者操作特征曲线下的面积(AUC)和精度召回曲线下的面积(AUPRC)来衡量判别能力。我们使用受限立方样条来评估平滑校准,使用决策曲线分析来评估临床实用性。
我们的结果表明,深度学习方法可以将判别能力提高 2.5%的 AUC 和 7.4%的 AUPRC。然而,平均而言,基线方法具有竞争力。除了图神经网络外,大多数模型的校准效果与基线相似。带有反向蒸馏的转换器在三个预测问题中的两个问题上,在大多数预测阈值下,在临床实用性方面表现最佳。
在这项研究中,我们使用神经网络和注意力机制评估了监督学习中的各种方法。在这里,我们不仅关注判别能力,还关注校准和临床实用性,进行了严格的比较。在电子健康记录数据上使用深度学习模型是有价值的,因为它可以提高判别能力和临床实用性,同时提供良好的校准。然而,好的基线方法仍然具有竞争力。