Department of Neurology, Erasmus MC University Medical Center Rotterdam, 40 Doctor Molewaterplein, P.O. Box 2040, Rotterdam, Zuid-Holland, 3015 GD, The Netherlands.
Department of Public Health, Erasmus MC University Medical Center Rotterdam, Rotterdam, Zuid-Holland, The Netherlands.
BMC Med Res Methodol. 2024 Aug 8;24(1):176. doi: 10.1186/s12874-024-02280-9.
Prediction models are often externally validated with data from a single study or cohort. However, the interpretation of performance estimates obtained with single-study external validation is not as straightforward as assumed. We aimed to illustrate this by conducting a large number of external validations of a prediction model for functional outcome in subarachnoid hemorrhage (SAH) patients.
We used data from the Subarachnoid Hemorrhage International Trialists (SAHIT) data repository (n = 11,931, 14 studies) to refit the SAHIT model for predicting a dichotomous functional outcome (favorable versus unfavorable), with the (extended) Glasgow Outcome Scale or modified Rankin Scale score, at a minimum of three months after discharge. We performed leave-one-cluster-out cross-validation to mimic the process of multiple single-study external validations. Each study represented one cluster. In each of these validations, we assessed discrimination with Harrell's c-statistic and calibration with calibration plots, the intercepts, and the slopes. We used random effects meta-analysis to obtain the (reference) mean performance estimates and between-study heterogeneity (I-statistic). The influence of case-mix variation on discriminative performance was assessed with the model-based c-statistic and we fitted a "membership model" to obtain a gross estimate of transportability.
Across 14 single-study external validations, model performance was highly variable. The mean c-statistic was 0.74 (95%CI 0.70-0.78, range 0.52-0.84, I = 0.92), the mean intercept was -0.06 (95%CI -0.37-0.24, range -1.40-0.75, I = 0.97), and the mean slope was 0.96 (95%CI 0.78-1.13, range 0.53-1.31, I = 0.90). The decrease in discriminative performance was attributable to case-mix variation, between-study heterogeneity, or a combination of both. Incidentally, we observed poor generalizability or transportability of the model.
We demonstrate two potential pitfalls in the interpretation of model performance with single-study external validation. With single-study external validation. (1) model performance is highly variable and depends on the choice of validation data and (2) no insight is provided into generalizability or transportability of the model that is needed to guide local implementation. As such, a single single-study external validation can easily be misinterpreted and lead to a false appreciation of the clinical prediction model. Cross-validation is better equipped to address these pitfalls.
预测模型通常通过来自单个研究或队列的数据进行外部验证。然而,使用单研究外部验证获得的性能估计的解释并不像假设的那样简单。我们旨在通过对蛛网膜下腔出血(SAH)患者功能结局的预测模型进行大量的外部验证来说明这一点。
我们使用蛛网膜下腔出血国际试验者(SAHIT)数据存储库的数据(n=11931,14 项研究)来重新拟合用于预测出院后至少三个月时出现二分类功能结局(有利与不利)的 SAHIT 模型,使用(扩展)格拉斯哥结局量表或改良 Rankin 量表评分。我们进行了单聚类留一交叉验证,以模拟多次单研究外部验证的过程。每个研究代表一个聚类。在这些验证中的每一个中,我们使用 Harrell 的 c 统计量评估区分度,并使用校准图、截距和斜率评估校准。我们使用随机效应荟萃分析获得(参考)平均性能估计值和研究间异质性(I 统计量)。使用基于模型的 c 统计量评估病例组合变异对判别性能的影响,并拟合“成员模型”以获得迁移能力的大致估计值。
在 14 项单研究外部验证中,模型性能变化很大。平均 c 统计量为 0.74(95%CI 0.70-0.78,范围 0.52-0.84,I=0.92),平均截距为-0.06(95%CI -0.37-0.24,范围-1.40-0.75,I=0.97),平均斜率为 0.96(95%CI 0.78-1.13,范围 0.53-1.31,I=0.90)。判别性能的下降归因于病例组合的变化、研究间的异质性或两者的组合。顺便说一句,我们观察到模型的通用性或可转移性较差。
我们在使用单研究外部验证解释模型性能时展示了两个潜在的陷阱。通过单研究外部验证,(1)模型性能变化很大,取决于验证数据的选择,(2)无法提供模型通用性或可转移性的信息,这是指导本地实施所必需的。因此,单一的单研究外部验证很容易被误解,并导致对临床预测模型的错误评价。交叉验证更适合解决这些问题。