Sheng Ming, Wang Shuliang, Zhang Yong, Hao Rui, Liang Ye, Luo Yi, Yang Wenhan, Wang Jincheng, Li Yinan, Zheng Wenkui, Li Wenyao
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081 China.
BNRist, DCST, RIIT, Tsinghua University, Beijing, 100084 China.
Health Inf Sci Syst. 2024 Jul 5;12(1):37. doi: 10.1007/s13755-024-00295-6. eCollection 2024 Dec.
Obtaining high-quality data sets from raw data is a key step before data exploration and analysis. Nowadays, in the medical domain, a large amount of data is in need of quality improvement before being used to analyze the health condition of patients. There have been many researches in data extraction, data cleaning and data imputation, respectively. However, there are seldom frameworks integrating with these three techniques, making the dataset suffer in accuracy, consistency and integrity. In this paper, a multi-source heterogeneous data enhancement framework based on a lakehouse MHDP is proposed, which includes three steps of data extraction, data cleaning and data imputation. In the data extraction step, a data fusion technique is offered to handle multi-modal and multi-source heterogeneous data. In the data cleaning step, we propose HoloCleanX, which provides a convenient interactive procedure. In the data imputation step, multiple imputation (MI) and the SOTA algorithm SAITS, are applied for different situations. We evaluate our framework via three tasks: clustering, classification and strategy prediction. The experimental results prove the effectiveness of our data enhancement framework.
从原始数据中获取高质量数据集是数据探索和分析之前的关键步骤。如今,在医学领域,大量数据在用于分析患者健康状况之前需要提高质量。分别在数据提取、数据清理和数据插补方面已经有很多研究。然而,很少有将这三种技术集成在一起的框架,这使得数据集在准确性、一致性和完整性方面受到影响。本文提出了一种基于湖仓MHDP的多源异构数据增强框架,它包括数据提取、数据清理和数据插补三个步骤。在数据提取步骤中,提供了一种数据融合技术来处理多模态和多源异构数据。在数据清理步骤中,我们提出了HoloCleanX,它提供了一个方便的交互式过程。在数据插补步骤中,针对不同情况应用了多重插补(MI)和最新算法SAITS。我们通过聚类、分类和策略预测这三个任务对我们的框架进行评估。实验结果证明了我们的数据增强框架的有效性。