Suppr超能文献

使用UMAP进行部分合成医疗表格数据生成与验证。

Using UMAP for Partially Synthetic Healthcare Tabular Data Generation and Validation.

作者信息

Lázaro Carla, Angulo Cecilio

机构信息

Intelligent Data Science and Artificial Intelligence Research Center, Technical University of Catalonia, Nexus II Building, Jordi Girona 29, 08034 Barcelona, Spain.

Robotics and Industrial Informatics Institute (CSIC-UPC), Llorens i Artigas 4, 08028 Barcelona, Spain.

出版信息

Sensors (Basel). 2024 Dec 8;24(23):7843. doi: 10.3390/s24237843.

Abstract

In healthcare, vast amounts of data are increasingly collected through sensors for smart health applications and patient monitoring or diagnosis. However, such medical data often comprise sensitive patient information, posing challenges regarding data privacy, and are resource-intensive to acquire for significant research purposes. In addition, the common case of lack of information due to technical issues, transcript errors, or differences between descriptors considered in different health centers leads to the need for data imputation and partial data generation techniques. This study introduces a novel methodology for partially synthetic tabular data generation, designed to reduce the reliance on sensor measurements and ensure secure data exchange. Using the UMAP (Uniform Manifold Approximation and Projection) visualization algorithm to transform the original, high-dimensional reference data set into a reduced-dimensional space, we generate and validate synthetic values for incomplete data sets. This approach mitigates the need for extensive sensor readings while addressing data privacy concerns by generating realistic synthetic samples. The proposed method is validated on prostate and breast cancer data sets, showing its effectiveness in completing and augmenting incomplete data sets using fully available references. Furthermore, our results demonstrate superior performance in comparison to state-of-the-art imputation techniques. This work makes a dual contribution by not only proposing an innovative method for synthetic data generation, but also studying and establishing a formal framework to understand and solve synthetic data generation and imputation problems in sensor-driven environments.

摘要

在医疗保健领域,通过传感器越来越多地收集大量数据,用于智能健康应用以及患者监测或诊断。然而,此类医疗数据通常包含敏感的患者信息,给数据隐私带来挑战,并且为了重大研究目的而获取这些数据需要耗费大量资源。此外,由于技术问题、转录错误或不同健康中心所考虑的描述符之间的差异导致信息缺失的常见情况,使得需要数据插补和部分数据生成技术。本研究介绍了一种用于部分合成表格数据生成的新方法,旨在减少对传感器测量的依赖并确保安全的数据交换。使用UMAP(均匀流形近似与投影)可视化算法将原始的高维参考数据集转换到低维空间,我们为不完整数据集生成并验证合成值。这种方法减少了对大量传感器读数的需求,同时通过生成逼真的合成样本解决了数据隐私问题。所提出的方法在前列腺癌和乳腺癌数据集上得到验证,表明其在使用完全可用的参考数据来完成和扩充不完整数据集方面的有效性。此外,我们的结果表明与现有最先进的插补技术相比具有卓越性能。这项工作做出了双重贡献,不仅提出了一种用于合成数据生成的创新方法,还研究并建立了一个正式框架,以理解和解决传感器驱动环境中的合成数据生成和插补问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/75fb/11645063/27e10da57664/sensors-24-07843-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验