• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

创建高质量合成健康数据:模型开发与验证框架。

Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation.

作者信息

Karimian Sichani Elnaz, Smith Aaron, El Emam Khaled, Mosquera Lucy

机构信息

Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, Canada.

Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.

出版信息

JMIR Form Res. 2024 Apr 22;8:e53241. doi: 10.2196/53241.

DOI:10.2196/53241
PMID:38648097
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11034549/
Abstract

BACKGROUND

Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data.

OBJECTIVE

This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected.

METHODS

We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients.

RESULTS

The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data.

CONCLUSIONS

We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.

摘要

背景

电子健康记录是患者信息的宝贵来源,在与研究人员共享之前必须进行适当的去识别处理。这个过程需要专业知识和时间。此外,合成数据大大减少了对真实数据使用和共享的限制,使研究人员能够在更少的隐私限制下更快地获取数据。因此,人们越来越有兴趣建立一种生成合成数据的方法,既能保护患者隐私,又能恰当地反映数据情况。

目的

本研究旨在开发并验证一个模型,该模型能生成有价值的合成纵向健康数据,同时保护所收集数据患者的隐私。

方法

我们研究了生成合成健康数据的最佳模型,重点关注纵向观察数据。我们开发了一种基于广义典型多向(GCP)张量分解的生成模型。该模型还涉及使用顺序决策树、copula和哈密顿蒙特卡罗方法从GCP分解的潜在因子矩阵(其中包含患者因子)中进行采样。我们将所提出的模型应用于MIMIC-III(版本1.4)数据集的样本。针对不同的数据结构和场景进行了大量分析和实验。我们通过进行效用评估来评估合成数据与真实数据之间的相似性。这些评估会考量数据中存在的结构和一般模式,如依赖结构、描述性统计和边际分布。关于隐私披露,我们的模型通过防止直接共享患者信息并消除观察到的记录与模型张量记录之间的一对一关联来保护隐私。这是通过对与患者相关的GCP分解潜在因子矩阵进行模拟和建模来实现的。

结果

研究结果表明,我们的模型是生成与真实数据足够相似的合成纵向健康数据的一种有前景的方法。它能够在处理各种数据结构和场景的同时,保留原始数据的效用和隐私。在某些实验中,模型中使用的所有模拟方法都产生了同样高水平的性能。我们的模型还能够应对从电子健康记录中对患者进行采样的挑战。这意味着我们可以在合成数据集中模拟各种患者,其数量可能与原始数据中的患者数量不同。

结论

我们提出了一种用于生成合成纵向健康数据的生成模型。该模型通过应用GCP张量分解来构建。我们在因式分解过程之后提供了3种用于合成和模拟潜在因子矩阵的方法。简而言之,我们将合成大量纵向健康数据的挑战简化为合成一个非纵向且规模小得多的数据集的挑战。

相似文献

1
Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation.创建高质量合成健康数据:模型开发与验证框架。
JMIR Form Res. 2024 Apr 22;8:e53241. doi: 10.2196/53241.
2
Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy.使用条件生成对抗网络结合差分隐私生成合成个人健康数据。
J Biomed Inform. 2023 Jul;143:104404. doi: 10.1016/j.jbi.2023.104404. Epub 2023 Jun 1.
3
A method for generating synthetic longitudinal health data.一种生成合成纵向健康数据的方法。
BMC Med Res Methodol. 2023 Mar 23;23(1):67. doi: 10.1186/s12874-023-01869-w.
4
Privacy of Study Participants in Open-access Health and Demographic Surveillance System Data: Requirements Analysis for Data Anonymization.开放获取健康和人口监测系统数据中研究参与者的隐私:数据匿名化的需求分析。
JMIR Public Health Surveill. 2022 Sep 2;8(9):e34472. doi: 10.2196/34472.
5
An evaluation of the replicability of analyses using synthetic health data.利用合成健康数据评估分析结果的可重复性。
Sci Rep. 2024 Mar 24;14(1):6978. doi: 10.1038/s41598-024-57207-7.
6
SynTEG: a framework for temporal structured electronic health data simulation.SynTEG:用于时间结构化电子健康数据模拟的框架。
J Am Med Inform Assoc. 2021 Mar 1;28(3):596-604. doi: 10.1093/jamia/ocaa262.
7
Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.评估合成乳腺癌临床试验数据集的效用和隐私性。
JCO Clin Cancer Inform. 2023 Sep;7:e2300116. doi: 10.1200/CCI.23.00116.
8
LDP-GAN : Generative adversarial networks with local differential privacy for patient medical records synthesis.LDP-GAN:用于患者医疗记录合成的具有局部差分隐私的生成对抗网络。
Comput Biol Med. 2024 Jan;168:107738. doi: 10.1016/j.compbiomed.2023.107738. Epub 2023 Nov 19.
9
Generating sequential electronic health records using dual adversarial autoencoder.使用对偶对抗自动编码器生成连续的电子健康记录。
J Am Med Inform Assoc. 2020 Jul 1;27(9):1411-1419. doi: 10.1093/jamia/ocaa119.
10
Privacy-Preserving Tensor Factorization for Collaborative Health Data Analysis.用于协作式健康数据分析的隐私保护张量分解
Proc ACM Int Conf Inf Knowl Manag. 2019 Nov;2019:1291-1300. doi: 10.1145/3357384.3357878.

本文引用的文献

1
Communication Efficient Federated Generalized Tensor Factorization for Collaborative Health Data Analytics.用于协作式健康数据分析的通信高效联邦广义张量分解
Proc Int World Wide Web Conf. 2021 Apr;2021:171-182. doi: 10.1145/3442381.3449832.
2
TASTE: Temporal and Static Tensor Factorization for Phenotyping Electronic Health Records.TASTE:用于电子健康记录表型分析的时间和静态张量分解
Proc ACM Conf Health Inference Learn (2020). 2020 Apr;2020:193-203. doi: 10.1145/3368555.3384464.
3
Optimizing the synthesis of clinical trial data using sequential trees.
使用序贯树优化临床试验数据的合成
J Am Med Inform Assoc. 2021 Jan 15;28(1):3-13. doi: 10.1093/jamia/ocaa249.
4
Learning multimorbidity patterns from electronic health records using Non-negative Matrix Factorisation.利用非负矩阵分解从电子健康记录中学习多重疾病模式。
J Biomed Inform. 2020 Dec;112:103606. doi: 10.1016/j.jbi.2020.103606. Epub 2020 Oct 27.
5
Generation and evaluation of synthetic patient data.生成和评估合成患者数据。
BMC Med Res Methodol. 2020 May 7;20(1):108. doi: 10.1186/s12874-020-00977-1.
6
Privacy-Preserving Tensor Factorization for Collaborative Health Data Analysis.用于协作式健康数据分析的隐私保护张量分解
Proc ACM Int Conf Inf Knowl Manag. 2019 Nov;2019:1291-1300. doi: 10.1145/3357384.3357878.
7
Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics.鲁比克:用于健康数据分析的知识引导张量分解与补全
KDD. 2015 Aug;2015:1265-1274. doi: 10.1145/2783258.2783395.
8
Distributed Tensor Decomposition for Large Scale Health Analytics.用于大规模健康分析的分布式张量分解
Proc Int World Wide Web Conf. 2019 May;2019:659-669. doi: 10.1145/3308558.3313548.
9
Re-identification Risks in HIPAA Safe Harbor Data: A study of data from one environmental health study.《健康保险流通与责任法案》安全港数据中的重新识别风险:一项对来自一项环境卫生研究数据的研究
Technol Sci. 2017;2017. Epub 2017 Aug 28.
10
Federated Tensor Factorization for Computational Phenotyping.用于计算表型分析的联邦张量分解
KDD. 2017 Aug;2017:887-895. doi: 10.1145/3097983.3098118.