Thangaraj Phyllis M, Shankar Sumukh Vasisht, Oikonomou Evangelos K, Khera Rohan
Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA.
Section of Health Informatics, Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.
medRxiv. 2023 Dec 15:2023.12.06.23299464. doi: 10.1101/2023.12.06.23299464.
Randomized clinical trials (RCTs) are designed to produce evidence in selected populations. Assessing their effects in the real-world is essential to change medical practice, however, key populations are historically underrepresented in the RCTs. We define an approach to simulate RCT-based effects in real-world settings using RCT digital twins reflecting the covariate patterns in an electronic health record (EHR).
We developed a Generative Adversarial Network (GAN) model, RCT-Twin-GAN, which generates a digital twin of an RCT (RCT-Twin) conditioned on covariate distributions from an EHR cohort. We improved upon a traditional tabular conditional GAN, CTGAN, with a loss function adapted for data distributions and by conditioning on multiple discrete and continuous covariates simultaneously. We assessed the similarity between a Heart Failure with preserved Ejection Fraction (HFpEF) RCT (TOPCAT), a Yale HFpEF EHR cohort, and RCT-Twin. We also evaluated cardiovascular event-free survival stratified by Spironolactone (treatment) use.
By applying RCT-Twin-GAN to 3445 TOPCAT participants and conditioning on 3445 Yale EHR HFpEF patients, we generated RCT-Twin datasets between 1141-3445 patients in size, depending on covariate conditioning and model parameters. RCT-Twin randomly allocated spironolactone (S)/ placebo (P) arms like an RCT, was similar to RCT by a multi-dimensional distance metric, and balanced covariates (median absolute standardized mean difference (MASMD) 0.017, IQR 0.0034-0.030). The 5 EHR-conditioned covariates in RCT-Twin were closer to the EHR compared with the RCT (MASMD 0.008 vs 0.63, IQR 0.005-0.018 vs 0.59-1.11). RCT-Twin reproduced the overall effect size seen in TOPCAT (5-year cardiovascular composite outcome odds ratio (95% confidence interval) of 0.89 (0.75-1.06) in RCT vs 0.85 (0.69-1.04) in RCT-Twin).
RCT-Twin-GAN simulates RCT-derived effects in real-world patients by translating these effects to the covariate distributions of EHR patients. This key methodological advance may enable the direct translation of RCT-derived effects into real-world patient populations and may enable causal inference in real-world settings.
随机临床试验(RCT)旨在在特定人群中产生证据。然而,评估其在现实世界中的效果对于改变医疗实践至关重要,关键人群在RCT中的代表性历来不足。我们定义了一种方法,使用反映电子健康记录(EHR)中协变量模式的RCT数字孪生体来模拟现实世界环境中基于RCT的效果。
我们开发了一种生成对抗网络(GAN)模型,即RCT-Twin-GAN,它根据EHR队列的协变量分布生成RCT的数字孪生体(RCT-Twin)。我们改进了传统的表格条件GAN(CTGAN),采用了适用于数据分布的损失函数,并同时基于多个离散和连续协变量进行条件设定。我们评估了射血分数保留的心力衰竭(HFpEF)RCT(TOPCAT)、耶鲁HFpEF EHR队列和RCT-Twin之间的相似性。我们还评估了按螺内酯(治疗)使用情况分层的无心血管事件生存期。
通过将RCT-Twin-GAN应用于3445名TOPCAT参与者,并以3445名耶鲁EHR HFpEF患者为条件,我们生成了规模在1141-3445名患者之间的RCT-Twin数据集,具体取决于协变量条件设定和模型参数。RCT-Twin像RCT一样随机分配螺内酯(S)/安慰剂(P)组,通过多维距离度量与RCT相似,并且协变量平衡(中位数绝对标准化均值差(MASMD)为0.017,四分位数间距为0.0034-0.030)。与RCT相比,RCT-Twin中的5个EHR条件协变量更接近EHR(MASMD为0.008对0.63,四分位数间距为0.005-0.018对0.59-1.11)。RCT-Twin再现了TOPCAT中观察到的总体效应大小(RCT中5年心血管综合结局优势比(95%置信区间)为0.89(0.75-1.06),而RCT-Twin中为0.85(0.69-1.04))。
RCT-Twin-GAN通过将基于RCT的效果转化为EHR患者的协变量分布,在现实世界患者中模拟基于RCT的效果。这一关键的方法学进展可能使基于RCT的效果能够直接转化为现实世界患者群体,并可能在现实世界环境中进行因果推断。