• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生成高保真的生存事件数据集,以提高数据透明度和可访问性。

Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility.

机构信息

Department of Health Sciences, Centre for Medicine, University of Leicester, University Road, Leicester, LE1 7RH, UK.

Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden.

出版信息

BMC Med Res Methodol. 2022 Jun 23;22(1):176. doi: 10.1186/s12874-022-01654-1.

DOI:10.1186/s12874-022-01654-1
PMID:35739465
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9229142/
Abstract

BACKGROUND

A lack of available data and statistical code being published alongside journal articles provides a significant barrier to open scientific discourse, and reproducibility of research. Information governance restrictions inhibit the active dissemination of individual level data to accompany published manuscripts. Realistic, high-fidelity time-to-event synthetic data can aid in the acceleration of methodological developments in survival analysis and beyond by enabling researchers to access and test published methods using data similar to that which they were developed on.

METHODS

We present methods to accurately emulate the covariate patterns and survival times found in real-world datasets using synthetic data techniques, without compromising patient privacy. We model the joint covariate distribution of the original data using covariate specific sequential conditional regression models, then fit a complex flexible parametric survival model from which to generate survival times conditional on individual covariate patterns. We recreate the administrative censoring mechanism using the last observed follow-up date information from the initial dataset. Metrics for evaluating the accuracy of the synthetic data, and the non-identifiability of individuals from the original dataset, are presented.

RESULTS

We successfully create a synthetic version of an example colon cancer dataset consisting of 9064 patients which aims to show good similarity to both covariate distributions and survival times from the original data, without containing any exact information from the original data, therefore allowing them to be published openly alongside research.

CONCLUSIONS

We evaluate the effectiveness of the methods for constructing synthetic data, as well as providing evidence that there is minimal risk that a given patient from the original data could be identified from their individual unique patient information. Synthetic datasets using this methodology could be made available alongside published research without breaching data privacy protocols, and allow for data and code to be made available alongside methodological or applied manuscripts to greatly improve the transparency and accessibility of medical research.

摘要

背景

缺乏可用于公开科学讨论和研究可重复性的可用数据和发布的统计代码,是一个重大障碍。信息治理限制阻碍了个别水平数据的积极传播,以配合已发表的手稿。真实、高保真的生存时间合成数据可以通过使研究人员能够使用与其开发数据相似的数据来访问和测试已发表的方法,从而加速生存分析和超越生存分析的方法发展。

方法

我们提出了使用合成数据技术准确模拟真实世界数据集中的协变量模式和生存时间的方法,同时不损害患者隐私。我们使用协变量特定的序贯条件回归模型来模拟原始数据的协变量分布,然后拟合复杂的灵活参数生存模型,从该模型中生成条件于个体协变量模式的生存时间。我们使用初始数据集的最后一次观察随访日期信息来重新创建行政删失机制。提出了评估合成数据准确性和原始数据中个体不可识别性的指标。

结果

我们成功地创建了一个包含 9064 名患者的结肠癌示例数据集的合成版本,旨在展示与原始数据的协变量分布和生存时间的良好相似性,而不包含原始数据的任何确切信息,因此可以与研究一起公开发布。

结论

我们评估了构建合成数据的方法的有效性,并提供了证据表明,从原始数据中识别出特定患者的风险极小。使用这种方法构建的合成数据集可以与已发表的研究一起提供,而不会违反数据隐私协议,并允许数据和代码与方法学或应用手稿一起提供,从而极大地提高医学研究的透明度和可访问性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/9d582bc71e8a/12874_2022_1654_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/9d5d8fbea6d8/12874_2022_1654_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/a56fd952077b/12874_2022_1654_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/822dcc81dfad/12874_2022_1654_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/b86837d1570d/12874_2022_1654_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/004853e1fd0b/12874_2022_1654_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/160546c5a2f2/12874_2022_1654_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/9d582bc71e8a/12874_2022_1654_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/9d5d8fbea6d8/12874_2022_1654_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/a56fd952077b/12874_2022_1654_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/822dcc81dfad/12874_2022_1654_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/b86837d1570d/12874_2022_1654_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/004853e1fd0b/12874_2022_1654_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/160546c5a2f2/12874_2022_1654_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b99/9229142/9d582bc71e8a/12874_2022_1654_Fig7_HTML.jpg

相似文献

1
Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility.生成高保真的生存事件数据集,以提高数据透明度和可访问性。
BMC Med Res Methodol. 2022 Jun 23;22(1):176. doi: 10.1186/s12874-022-01654-1.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.
4
A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation.生物行为科学合成数据集入门,促进可重复性和假设生成。
Elife. 2020 Mar 11;9:e53275. doi: 10.7554/eLife.53275.
5
Validation Assessment of Privacy-Preserving Synthetic Electronic Health Record Data: Comparison of Original Versus Synthetic Data on Real-World COVID-19 Vaccine Effectiveness.隐私保护合成电子健康记录数据的验证评估:真实世界 COVID-19 疫苗有效性的原始数据与合成数据比较。
Pharmacoepidemiol Drug Saf. 2024 Oct;33(10):e70019. doi: 10.1002/pds.70019.
6
Transparency of research practices in cardiovascular literature.心血管文献中研究实践的透明度。
Elife. 2025 Mar 26;14:e81051. doi: 10.7554/eLife.81051.
7
A method for generating synthetic longitudinal health data.一种生成合成纵向健康数据的方法。
BMC Med Res Methodol. 2023 Mar 23;23(1):67. doi: 10.1186/s12874-023-01869-w.
8
Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.在新合成数据集上训练的集成机器学习模型,对于使用可穿戴设备进行压力预测具有良好的泛化能力。
J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.
9
Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.大型语言模型生成合成临床数据集:与真实世界围手术期数据的可行性及对比分析
Front Artif Intell. 2025 Feb 5;8:1533508. doi: 10.3389/frai.2025.1533508. eCollection 2025.
10
Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy.使用条件生成对抗网络结合差分隐私生成合成个人健康数据。
J Biomed Inform. 2023 Jul;143:104404. doi: 10.1016/j.jbi.2023.104404. Epub 2023 Jun 1.

引用本文的文献

1
Releasing synthetic data from the Avon Longitudinal Study of Parents and Children (ALSPAC): Guidelines and applied examples.发布来自阿冯纵向父母与儿童研究(ALSPAC)的合成数据:指南及应用示例。
Wellcome Open Res. 2024 Dec 24;9:57. doi: 10.12688/wellcomeopenres.20530.2. eCollection 2024.
2
Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer.合成数据在异质性和罕见医疗保健人群中的适用性:患有癌症的青少年和青年成年人
JCO Clin Cancer Inform. 2024 Dec;8:e2400056. doi: 10.1200/CCI.24.00056. Epub 2024 Dec 3.
3
Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study.

本文引用的文献

1
Synthetic data in machine learning for medicine and healthcare.机器学习在医学和医疗保健领域中的合成数据。
Nat Biomed Eng. 2021 Jun;5(6):493-497. doi: 10.1038/s41551-021-00751-8.
2
Can synthetic data be a proxy for real clinical trial data? A validation study.合成数据能否替代真实的临床试验数据?一项验证性研究。
BMJ Open. 2021 Apr 16;11(4):e043497. doi: 10.1136/bmjopen-2020-043497.
3
Evaluating the utility of synthetic COVID-19 case data.评估合成新冠病毒病例数据的效用。
肿瘤临床试验中对照组生存数据的合成数据生成技术比较:模拟研究
JMIR Med Inform. 2024 Jun 18;12:e55118. doi: 10.2196/55118.
4
Privacy-Preserving Federated Survival Support Vector Machines for Cross-Institutional Time-To-Event Analysis: Algorithm Development and Validation.用于跨机构事件发生时间分析的隐私保护联合生存支持向量机:算法开发与验证
JMIR AI. 2024 Mar 29;3:e47652. doi: 10.2196/47652.
5
Flexible parametric methods for calculating life expectancy in small populations.灵活的参数方法计算小种群的预期寿命。
Popul Health Metr. 2023 Sep 13;21(1):13. doi: 10.1186/s12963-023-00313-x.
6
Improving communication of cancer survival statistics-feasibility of implementing model-based algorithms in routine publications.提高癌症生存统计数据的传播效果-在常规出版物中实施基于模型算法的可行性。
Br J Cancer. 2023 Sep;129(5):819-828. doi: 10.1038/s41416-023-02360-5. Epub 2023 Jul 11.
JAMIA Open. 2021 Mar 1;4(1):ooab012. doi: 10.1093/jamiaopen/ooab012. eCollection 2021 Jan.
4
Understanding disparities in cancer prognosis: An extension of mediation analysis to the relative survival framework.理解癌症预后差异:中介分析在相对生存框架中的扩展。
Biom J. 2021 Feb;63(2):341-353. doi: 10.1002/bimj.201900355. Epub 2020 Dec 14.
5
Generating high-fidelity synthetic patient data for assessing machine learning healthcare software.生成用于评估机器学习医疗软件的高保真合成患者数据。
NPJ Digit Med. 2020 Nov 9;3(1):147. doi: 10.1038/s41746-020-00353-9.
6
Understanding the impact of sex and stage differences on melanoma cancer patient survival: a SEER-based study.了解性别和分期差异对黑色素瘤癌症患者生存的影响:一项基于 SEER 的研究。
Br J Cancer. 2021 Feb;124(3):671-677. doi: 10.1038/s41416-020-01144-5. Epub 2020 Nov 4.
7
Generation and evaluation of synthetic patient data.生成和评估合成患者数据。
BMC Med Res Methodol. 2020 May 7;20(1):108. doi: 10.1186/s12874-020-00977-1.
8
Availability of Statistical Code From Studies Using Medicare Data in General Medical Journals.综合医学期刊中使用医疗保险数据的研究的统计代码可用性。
JAMA Intern Med. 2020 Jun 1;180(6):905-907. doi: 10.1001/jamainternmed.2020.0671.
9
Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN).基于生成对抗网络的数据合成匿名化(ADS-GAN)。
IEEE J Biomed Health Inform. 2020 Aug;24(8):2378-2388. doi: 10.1109/JBHI.2020.2980262. Epub 2020 Mar 12.
10
No raw data, no science: another possible source of the reproducibility crisis.无原始数据,无科学:再现性危机的另一个可能来源。
Mol Brain. 2020 Feb 21;13(1):24. doi: 10.1186/s13041-020-0552-2.