Botnar Research Centre, Centre for Statistics in Medicine, Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, UK.
Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.
Comput Methods Programs Biomed. 2021 Nov;211:106394. doi: 10.1016/j.cmpb.2021.106394. Epub 2021 Sep 6.
As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code).
We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA.
Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated.
Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.
作为对当前 COVID-19 大流行的应对措施,现有文献中迅速开发了几种预测模型,旨在提供基于证据的指导。然而,这些 COVID-19 预测模型都没有被证明是可靠的。这些模型通常被评估为存在偏倚风险,这往往是由于报告不充分、使用非代表性数据以及缺乏大规模外部验证所致。在本文中,我们提出了观察性健康数据科学和信息学(OHDSI)的患者水平预测建模分析管道,作为一种快速而可靠地开发和验证预测模型的标准化方法。我们展示了如何使用我们的分析管道和开源软件工具来回答重要的预测问题,同时限制潜在的偏倚原因(例如,通过验证表型、指定目标人群、进行大规模外部验证以及公开提供所有分析源代码)。
我们逐步展示了如何在一个包含超过 20000 例 COVID-19 住院患者的美国索赔数据库中,使用六种不同的机器学习方法实施针对问题“在因 COVID-19 住院的患者中,住院后 0 至 30 天内的死亡风险是多少?”的分析管道。我们使用来自韩国、西班牙和美国的超过 45000 例 COVID-19 住院患者的数据对模型进行外部验证。
我们的开源软件工具使我们能够从问题设计到可靠的模型开发和评估高效地端到端进行操作。在预测因 COVID-19 住院的患者的死亡风险时,AdaBoost、随机森林、梯度提升机和决策树的内部和外部验证区分性能与 L1 正则化逻辑回归相似或较低,而 MLP 神经网络的区分性能始终较低。L1 正则化逻辑回归模型具有良好的校准性能。
我们的结果表明,遵循 OHDSI 患者水平预测建模分析管道可以快速开发出可靠的预测模型。OHDSI 软件工具和管道是开源的,可供来自世界各地的研究人员使用。