Finster Melissa, Moinat Maxim, Taghizadeh Elham
Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Bremen, Germany.
Erasmus University Medical Center, Rotterdam, South Holland, Netherlands.
PLoS One. 2025 Jan 6;20(1):e0311511. doi: 10.1371/journal.pone.0311511. eCollection 2025.
The German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.
We developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.
For Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.
The ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.
德国健康数据实验室将提供2009年至今的德国法定医疗保险理赔数据用于研究目的。由于德国健康数据实验室内部数据格式不断演变,需要将这些数据标准化为通用数据模型,以促进合作性健康研究,并尽量减少研究人员适应多种数据格式的需求。为此,我们选择将数据转换为观察性医疗结局合作组织通用数据模型。
我们针对德国健康数据实验室两种不同的数据格式开发了一个提取、转换和加载(ETL)管道:格式1(2009 - 2016年)和格式3(2019年起)。由于格式1和格式2(2017 - 2018年)的格式结构相同,格式1的ETL管道也可应用于格式2。我们的ETL过程在观察性健康数据科学与信息学工具的支持下,包括规范开发、SQL框架创建和概念映射。我们详细介绍了过程特点,并使用示例数据进行了质量评估,包括字段覆盖和概念映射准确性。
对于格式1,我们实现了92.7%的字段覆盖。数据质量仪表板显示一致性为100.0%,完整性为80.6%,不过合理性检查被禁用。由于提供的示例数据中存在无效代码和缺失映射,条件领域的映射覆盖率较低,为18.3%。对于格式3,字段覆盖为86.2%,数据质量仪表板报告一致性为99.3%,完整性为75.9%。由于使用了模拟数据和未映射的本地概念,程序领域的映射覆盖率非常低(2.2%)。条件领域的结果是99.8%的唯一代码被映射。真实数据的缺失限制了对质量的全面评估。
ETL过程有效地转换了数据,具有高字段覆盖和一致性。它简化了德国健康数据实验室用户的数据利用,并增强了OHDSI分析工具的使用。该举措通过提供公开可用的标准化ETL过程(https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP)及其性能评估,朝着促进欧洲跨境研究迈出了重要一步。