Karimian Sichani Elnaz, Smith Aaron, El Emam Khaled, Mosquera Lucy
Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, Canada.
Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.
JMIR Form Res. 2024 Apr 22;8:e53241. doi: 10.2196/53241.
Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data.
This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected.
We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients.
The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data.
We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.
电子健康记录是患者信息的宝贵来源,在与研究人员共享之前必须进行适当的去识别处理。这个过程需要专业知识和时间。此外,合成数据大大减少了对真实数据使用和共享的限制,使研究人员能够在更少的隐私限制下更快地获取数据。因此,人们越来越有兴趣建立一种生成合成数据的方法,既能保护患者隐私,又能恰当地反映数据情况。
本研究旨在开发并验证一个模型,该模型能生成有价值的合成纵向健康数据,同时保护所收集数据患者的隐私。
我们研究了生成合成健康数据的最佳模型,重点关注纵向观察数据。我们开发了一种基于广义典型多向(GCP)张量分解的生成模型。该模型还涉及使用顺序决策树、copula和哈密顿蒙特卡罗方法从GCP分解的潜在因子矩阵(其中包含患者因子)中进行采样。我们将所提出的模型应用于MIMIC-III(版本1.4)数据集的样本。针对不同的数据结构和场景进行了大量分析和实验。我们通过进行效用评估来评估合成数据与真实数据之间的相似性。这些评估会考量数据中存在的结构和一般模式,如依赖结构、描述性统计和边际分布。关于隐私披露,我们的模型通过防止直接共享患者信息并消除观察到的记录与模型张量记录之间的一对一关联来保护隐私。这是通过对与患者相关的GCP分解潜在因子矩阵进行模拟和建模来实现的。
研究结果表明,我们的模型是生成与真实数据足够相似的合成纵向健康数据的一种有前景的方法。它能够在处理各种数据结构和场景的同时,保留原始数据的效用和隐私。在某些实验中,模型中使用的所有模拟方法都产生了同样高水平的性能。我们的模型还能够应对从电子健康记录中对患者进行采样的挑战。这意味着我们可以在合成数据集中模拟各种患者,其数量可能与原始数据中的患者数量不同。
我们提出了一种用于生成合成纵向健康数据的生成模型。该模型通过应用GCP张量分解来构建。我们在因式分解过程之后提供了3种用于合成和模拟潜在因子矩阵的方法。简而言之,我们将合成大量纵向健康数据的挑战简化为合成一个非纵向且规模小得多的数据集的挑战。