Owkin France, Paris, France.
BMC Med Res Methodol. 2022 Dec 28;22(1):335. doi: 10.1186/s12874-022-01799-z.
An external control arm is a cohort of control patients that are collected from data external to a single-arm trial. To provide an unbiased estimation of efficacy, the clinical profiles of patients from single and external arms should be aligned, typically using propensity score approaches. There are alternative approaches to infer efficacy based on comparisons between outcomes of single-arm patients and machine-learning predictions of control patient outcomes. These methods include G-computation and Doubly Debiased Machine Learning (DDML) and their evaluation for External Control Arms (ECA) analysis is insufficient.
We consider both numerical simulations and a trial replication procedure to evaluate the different statistical approaches: propensity score matching, Inverse Probability of Treatment Weighting (IPTW), G-computation, and DDML. The replication study relies on five type 2 diabetes randomized clinical trials granted by the Yale University Open Data Access (YODA) project. From the pool of five trials, observational experiments are artificially built by replacing a control arm from one trial by an arm originating from another trial and containing similarly-treated patients.
Among the different statistical approaches, numerical simulations show that DDML has the smallest bias followed by G-computation. In terms of mean squared error, G-computation usually minimizes mean squared error. Compared to other methods, DDML has varying Mean Squared Error performances that improves with increasing sample sizes. For hypothesis testing, all methods control type I error and DDML is the most conservative. G-computation is the best method in terms of statistical power, and DDML has comparable power at [Formula: see text] but inferior ones for smaller sample sizes. The replication procedure also indicates that G-computation minimizes mean squared error whereas DDML has intermediate performances in between G-computation and propensity score approaches. The confidence intervals of G-computation are the narrowest whereas confidence intervals obtained with DDML are the widest for small sample sizes, which confirms its conservative nature.
For external control arm analyses, methods based on outcome prediction models can reduce estimation error and increase statistical power compared to propensity score approaches.
外部对照臂是从单臂试验外部数据中收集的对照患者队列。为了提供疗效的无偏估计,单臂和外部臂患者的临床特征应通过倾向评分方法进行匹配。还有其他方法可以基于单臂患者的结果与对照患者结果的机器学习预测之间的比较来推断疗效。这些方法包括 G 计算和双重偏差机器学习(DDML),但对外部对照臂(ECA)分析的评估不足。
我们考虑了数值模拟和试验复制程序来评估不同的统计方法:倾向评分匹配、逆处理概率加权(IPTW)、G 计算和 DDML。复制研究依赖于耶鲁大学开放数据访问(YODA)项目授予的五项 2 型糖尿病随机临床试验。在这五项试验中,通过从一项试验中替换对照臂并使用来自另一项试验的臂来构建人工观察实验,该臂包含接受类似治疗的患者。
在不同的统计方法中,数值模拟表明 DDML 的偏差最小,其次是 G 计算。在均方误差方面,G 计算通常最小化均方误差。与其他方法相比,DDML 的均方误差性能各不相同,随着样本量的增加而提高。对于假设检验,所有方法均控制 I 型错误,而 DDML 最保守。G 计算在统计功效方面是最好的方法,而 DDML 在 [Formula: see text] 时具有可比的功效,但在较小的样本量时效果较差。复制程序还表明,G 计算最小化均方误差,而 DDML 在 G 计算和倾向评分方法之间具有中间性能。G 计算的置信区间最窄,而 DDML 的置信区间在小样本量时最宽,这证实了其保守性。
对于外部对照臂分析,基于结果预测模型的方法与倾向评分方法相比,可以减少估计误差并提高统计功效。