Talvik Harry-Anton, Oja Marek, Tamm Sirli, Mooses Kerli, Särg Dage, Lõo Marcus, Renata Siimon Õie, Šuvalov Hendrik, Kolde Raivo, Vilo Jaak, Reisberg Sulev, Laur Sven
Institute of Computer Science, University of Tartu, 51009 Tartu, Estonia; STACC, 51009 Tartu, Estonia.
Institute of Computer Science, University of Tartu, 51009 Tartu, Estonia.
J Biomed Inform. 2025 Jan;161:104765. doi: 10.1016/j.jbi.2024.104765. Epub 2024 Dec 26.
This study aims to address the gap in the literature on converting real-world Clinical Document Architecture (CDA) data into the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), focusing on the initial steps preceding the mapping phase. We highlight the importance of a repeatable Extract-Transform-Load (ETL) pipeline for health data extraction from HL7 CDA documents in Estonia for research purposes.
We developed a repeatable ETL pipeline to facilitate the extraction, cleaning, and restructuring of health data from CDA documents to OMOP CDM, ensuring a high-quality and structured data format. This pipeline was designed to adapt to continuously updated data exchange format changes and handle various CDA document subsets for different scientific studies.
We demonstrated via selected use cases that our pipeline successfully transformed a significant portion of diagnosis codes, body weight and eGFR measurements, and PAP test results from CDA documents into OMOP CDM, showing the ease of extracting structured data. However, challenges such as harmonising diverse coding systems and extracting lab results from free-text sections were encountered. The iterative development of the pipeline facilitated swift error detection and correction, enhancing the process's efficiency.
After a decade of focused work, our research has led to the development of an ETL pipeline that effectively transforms HL7 CDA documents into OMOP CDM in Estonia, addressing key data extraction and transformation challenges. The pipeline's repeatability and adaptability to various data subsets make it a valuable resource for researchers dealing with health data. While tested on Estonian data, the principles outlined are broadly applicable, potentially aiding in handling health data standards that vary by country. Despite newer health data standards emerging, the relevance of CDA for retrospective health studies ensures the continuing importance of this work.
本研究旨在填补文献中关于将真实世界临床文档架构(CDA)数据转换为观察性医疗结果合作组织(OMOP)通用数据模型(CDM)的空白,重点关注映射阶段之前的初始步骤。我们强调了一个可重复的提取-转换-加载(ETL)管道对于从爱沙尼亚的HL7 CDA文档中提取健康数据用于研究目的的重要性。
我们开发了一个可重复的ETL管道,以促进从CDA文档中提取、清理和重组健康数据到OMOP CDM,确保高质量和结构化的数据格式。该管道旨在适应不断更新的数据交换格式变化,并处理针对不同科学研究的各种CDA文档子集。
我们通过选定的用例证明,我们的管道成功地将CDA文档中的很大一部分诊断代码、体重和估算肾小球滤过率(eGFR)测量值以及巴氏试验结果转换为OMOP CDM,显示了提取结构化数据的简便性。然而,遇到了诸如协调不同编码系统以及从自由文本部分提取实验室结果等挑战。管道的迭代开发有助于快速检测和纠正错误,提高了该过程的效率。
经过十年的专注工作,我们的研究促成了一个ETL管道的开发,该管道在爱沙尼亚有效地将HL7 CDA文档转换为OMOP CDM,解决了关键的数据提取和转换挑战。该管道的可重复性和对各种数据子集的适应性使其成为处理健康数据的研究人员的宝贵资源。虽然在爱沙尼亚的数据上进行了测试,但所概述的原则具有广泛适用性,可能有助于处理因国家而异的健康数据标准。尽管出现了更新的健康数据标准,但CDA对于回顾性健康研究的相关性确保了这项工作的持续重要性。