Department of Health Sciences, Centre for Medicine, University of Leicester, University Road, Leicester, LE1 7RH, UK.
Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden.
BMC Med Res Methodol. 2022 Jun 23;22(1):176. doi: 10.1186/s12874-022-01654-1.
A lack of available data and statistical code being published alongside journal articles provides a significant barrier to open scientific discourse, and reproducibility of research. Information governance restrictions inhibit the active dissemination of individual level data to accompany published manuscripts. Realistic, high-fidelity time-to-event synthetic data can aid in the acceleration of methodological developments in survival analysis and beyond by enabling researchers to access and test published methods using data similar to that which they were developed on.
We present methods to accurately emulate the covariate patterns and survival times found in real-world datasets using synthetic data techniques, without compromising patient privacy. We model the joint covariate distribution of the original data using covariate specific sequential conditional regression models, then fit a complex flexible parametric survival model from which to generate survival times conditional on individual covariate patterns. We recreate the administrative censoring mechanism using the last observed follow-up date information from the initial dataset. Metrics for evaluating the accuracy of the synthetic data, and the non-identifiability of individuals from the original dataset, are presented.
We successfully create a synthetic version of an example colon cancer dataset consisting of 9064 patients which aims to show good similarity to both covariate distributions and survival times from the original data, without containing any exact information from the original data, therefore allowing them to be published openly alongside research.
We evaluate the effectiveness of the methods for constructing synthetic data, as well as providing evidence that there is minimal risk that a given patient from the original data could be identified from their individual unique patient information. Synthetic datasets using this methodology could be made available alongside published research without breaching data privacy protocols, and allow for data and code to be made available alongside methodological or applied manuscripts to greatly improve the transparency and accessibility of medical research.
缺乏可用于公开科学讨论和研究可重复性的可用数据和发布的统计代码,是一个重大障碍。信息治理限制阻碍了个别水平数据的积极传播,以配合已发表的手稿。真实、高保真的生存时间合成数据可以通过使研究人员能够使用与其开发数据相似的数据来访问和测试已发表的方法,从而加速生存分析和超越生存分析的方法发展。
我们提出了使用合成数据技术准确模拟真实世界数据集中的协变量模式和生存时间的方法,同时不损害患者隐私。我们使用协变量特定的序贯条件回归模型来模拟原始数据的协变量分布,然后拟合复杂的灵活参数生存模型,从该模型中生成条件于个体协变量模式的生存时间。我们使用初始数据集的最后一次观察随访日期信息来重新创建行政删失机制。提出了评估合成数据准确性和原始数据中个体不可识别性的指标。
我们成功地创建了一个包含 9064 名患者的结肠癌示例数据集的合成版本,旨在展示与原始数据的协变量分布和生存时间的良好相似性,而不包含原始数据的任何确切信息,因此可以与研究一起公开发布。
我们评估了构建合成数据的方法的有效性,并提供了证据表明,从原始数据中识别出特定患者的风险极小。使用这种方法构建的合成数据集可以与已发表的研究一起提供,而不会违反数据隐私协议,并允许数据和代码与方法学或应用手稿一起提供,从而极大地提高医学研究的透明度和可访问性。