Reps Jenna M, Rijnbeek Peter R, Ryan Patrick B
, Johnson & Johnson, Raritan, NJ, USA.
Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands.
Diagn Progn Res. 2025 Apr 17;9(1):10. doi: 10.1186/s41512-025-00191-x.
Large observational healthcare databases are frequently used to develop models to be implemented in real-world clinical practice populations. For example, these databases were used to develop COVID severity models that guided interventions such as who to prioritize vaccinating during the pandemic. However, the clinical setting and observational databases often differ in the types of patients (case mix), and it is a nontrivial process to identify patients with medical conditions (phenotyping) in these databases. In this study, we investigate how sensitive a model's performance is to the choice of development database, population, and outcome phenotype.
We developed > 450 different logistic regression models for nine prediction tasks across seven databases with a range of suitable population and outcome phenotypes. Performance stability within tasks was calculated by applying each model to data created by permuting the database, population, or outcome phenotype. We investigate performance (AUROC, scaled Brier, and calibration-in-the-large) stability and individual risk estimate stability.
In general, changing the outcome definitions or population phenotype made little impact on the model validation discrimination. However, validation discrimination was unstable when the database changed. Calibration and Brier performance were unstable when the population, outcome definition, or database changed. This may be problematic if a model developed using observational data is implemented in a real-world setting.
These results highlight the importance of validating a model developed using observational data in the clinical setting prior to using it for decision-making. Calibration and Brier score should be evaluated to prevent miscalibrated risk estimates being used to aid clinical decisions.
大型观察性医疗保健数据库经常被用于开发可在实际临床实践人群中实施的模型。例如,这些数据库被用于开发新冠严重程度模型,该模型指导了诸如在疫情期间确定优先接种疫苗对象等干预措施。然而,临床环境和观察性数据库在患者类型(病例组合)方面往往存在差异,并且在这些数据库中识别患有特定疾病的患者(表型分析)是一个复杂的过程。在本研究中,我们调查了模型性能对开发数据库、人群和结局表型选择的敏感程度。
我们针对七个数据库中的九项预测任务,开发了超过450种不同的逻辑回归模型,这些模型具有一系列合适的人群和结局表型。通过将每个模型应用于通过对数据库、人群或结局表型进行置换而创建的数据,计算任务内的性能稳定性。我们研究了性能(曲线下面积、缩放布里尔得分和整体校准)稳定性以及个体风险估计稳定性。
总体而言,改变结局定义或人群表型对模型验证辨别力影响不大。然而,当数据库改变时,验证辨别力不稳定。当人群、结局定义或数据库改变时,校准和布里尔性能不稳定。如果将使用观察性数据开发的模型应用于实际环境中,这可能会产生问题。
这些结果凸显了在将使用观察性数据开发的模型用于决策之前,在临床环境中对其进行验证的重要性。应评估校准和布里尔得分,以防止使用校准错误的风险估计来辅助临床决策。