Suppr超能文献

医疗保健中合成时间序列生成方法的比较评估:利用患者元数据进行准确的数据合成。

Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis.

机构信息

Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain.

Computer Science and Artificial Intelligence Department, Computer Science Faculty, University of the Basque Country (UPV/EHU), Donostia - San Sebastian, Spain.

出版信息

BMC Med Inform Decis Mak. 2024 Jan 30;24(1):27. doi: 10.1186/s12911-024-02427-0.

Abstract

BACKGROUND

Synthetic data is an emerging approach for addressing legal and regulatory concerns in biomedical research that deals with personal and clinical data, whether as a single tool or through its combination with other privacy enhancing technologies. Generating uncompromised synthetic data could significantly benefit external researchers performing secondary analyses by providing unlimited access to information while fulfilling pertinent regulations. However, the original data to be synthesized (e.g., data acquired in Living Labs) may consist of subjects' metadata (static) and a longitudinal component (set of time-dependent measurements), making it challenging to produce coherent synthetic counterparts.

METHODS

Three synthetic time series generation approaches were defined and compared in this work: only generating the metadata and coupling it with the real time series from the original data (A1), generating both metadata and time series separately to join them afterwards (A2), and jointly generating both metadata and time series (A3). The comparative assessment of the three approaches was carried out using two different synthetic data generation models: the Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). The experiments were performed with three different healthcare-related longitudinal datasets: Treadmill Maximal Effort Test (TMET) measurements from the University of Malaga (1), a hypotension subset derived from the MIMIC-III v1.4 database (2), and a lifelogging dataset named PMData (3).

RESULTS

Three pivotal dimensions were assessed on the generated synthetic data: resemblance to the original data (1), utility (2), and privacy level (3). The optimal approach fluctuates based on the assessed dimension and metric.

CONCLUSION

The initial characteristics of the datasets to be synthesized play a crucial role in determining the best approach. Coupling synthetic metadata with real time series (A1), as well as jointly generating synthetic time series and metadata (A3), are both competitive methods, while separately generating time series and metadata (A2) appears to perform more poorly overall.

摘要

背景

合成数据是一种新兴的方法,可用于解决涉及个人和临床数据的生物医学研究中的法律和监管问题,无论是作为单一工具还是通过与其他增强隐私的技术相结合。生成无妥协的合成数据可以通过提供对信息的无限访问,同时满足相关法规,极大地有益于执行二次分析的外部研究人员。然而,要合成的原始数据(例如,在生命实验室中获取的数据)可能由主体的元数据(静态)和纵向组成部分(一组时变测量值)组成,因此很难生成连贯的合成对应物。

方法

本研究中定义并比较了三种合成时间序列生成方法:仅生成元数据并将其与原始数据中的真实时间序列耦合(A1),分别生成元数据和时间序列,然后将它们合并(A2),以及联合生成元数据和时间序列(A3)。使用两种不同的合成数据生成模型:带有梯度惩罚的 Wasserstein GAN(WGAN-GP)和 DoppelGANger(DGAN),对三种方法进行了比较评估。实验使用了三个不同的与医疗保健相关的纵向数据集:马拉加大学的跑步机最大努力测试(TMET)测量值(1),从 MIMIC-III v1.4 数据库派生的低血压子集(2),以及名为 PMData 的生活记录数据集(3)。

结果

在生成的合成数据上评估了三个关键维度:与原始数据的相似性(1)、实用性(2)和隐私级别(3)。最佳方法根据评估的维度和指标而波动。

结论

要合成的数据集的初始特征在确定最佳方法方面起着至关重要的作用。合成元数据与真实时间序列的耦合(A1)以及联合生成合成时间序列和元数据(A3)都是有竞争力的方法,而单独生成时间序列和元数据(A2)总体上表现较差。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3de9/10826010/ec05813557b7/12911_2024_2427_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验