Azizi Zahra, Zheng Chaoyi, Mosquera Lucy, Pilote Louise, El Emam Khaled
Center for Outcomes Research and Evaluation, Faculty of Medicine, McGill University, Montreal, Québec, Canada.
Data Science, Replica Analytics Ltd, Ottawa, Ontario, Canada.
BMJ Open. 2021 Apr 16;11(4):e043497. doi: 10.1136/bmjopen-2020-043497.
There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.
Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method.
There were 1543 patients in the control arm that were included in our analysis.
Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets.
Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1).
The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets.
NCT00079274.
使研究数据,尤其是临床试验数据更广泛地用于二次分析的需求日益增加。然而,由于复杂的隐私要求,数据可用性仍然是一个挑战。使用合成数据可能解决这一挑战。
使用机器学习方法生成的合成数据对已发表的III期结肠癌试验二次分析进行复制。
我们的分析纳入了对照组中的1543例患者。
在合成数据上复制了对真实数据集发表的一项研究的分析,以研究肠梗阻与无事件生存期之间的关系。使用信息理论指标比较真实数据和合成数据之间的单变量分布。使用百分比置信区间重叠来评估双变量关系大小的相似性,对于从两个数据集得出的多变量Cox模型也是如此。
真实数据集和合成数据集的分析结果相似。在信息理论指标上,单变量分布的差异在1%以内。所有双变量关系在tau统计量上的置信区间重叠均超过50%。已发表研究的主要结论,即无肠梗阻对生存有强烈影响,在方向上得到了复制,总体生存期的真实数据和合成数据之间的风险比置信区间重叠为61%(真实数据:风险比1.56,95%置信区间1.至2.2;合成数据:风险比2.03,95%置信区间1.44至2.87),无病生存期为86%(真实数据:风险比1.51,95%置信区间1.18至1.95;合成数据:风险比1.63,95%置信区间1.26至2.1)。
合成数据与真实数据的分析结果和结论高度一致,表明合成数据可作为真实临床试验数据集的合理替代。
NCT00079274。