Center of Health Data Science, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany.
Digital Health Center, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany.
Stud Health Technol Inform. 2023 May 18;302:691-695. doi: 10.3233/SHTI230241.
Making health data available for secondary use enables innovative data-driven medical research. Since modern machine learning (ML) methods and precision medicine require extensive amounts of data covering most of the standard and edge cases, it is essential to initially acquire large datasets. This can typically only be achieved by integrating different datasets from various sources and sharing data across sites. To obtain a unified dataset from heterogeneous sources, standard representations and Common Data Models (CDM) are needed. The process of mapping data into these standardized representations is usually very tedious and requires many manual configuration and refinement steps. A potential way to reduce these efforts is to use ML methods not only for data analysis, but also for the integration of health data on the syntactic, structural, and semantic level. However, research on ML-based medical data integration is still in its infancy. In this article, we describe the current state of the literature and present selected methods that appear to have a particularly high potential to improve medical data integration. Moreover, we discuss open issues and possible future research directions.
使健康数据可用于二次使用,可以实现创新的数据驱动型医学研究。由于现代机器学习 (ML) 方法和精准医疗需要涵盖大多数标准和边缘情况的大量数据,因此最初获取大型数据集至关重要。这通常只能通过整合来自不同来源的不同数据集并在站点之间共享数据来实现。为了从异构源中获得统一的数据集,需要标准表示和通用数据模型 (CDM)。将数据映射到这些标准化表示中的过程通常非常繁琐,需要许多手动配置和细化步骤。一种可能的方法是不仅使用 ML 方法进行数据分析,还用于整合在语法、结构和语义层面上的健康数据。然而,基于机器学习的医学数据集成的研究仍处于起步阶段。在本文中,我们描述了文献的现状,并介绍了一些似乎具有很高潜力来改进医学数据集成的选定方法。此外,我们还讨论了开放问题和可能的未来研究方向。