Centre for Big Data Research in Health, UNSW, Sydney, Australia.
Concord Clinical School, University of Sydney, Sydney, Australia.
PLoS One. 2022 Apr 11;17(4):e0266911. doi: 10.1371/journal.pone.0266911. eCollection 2022.
Common data models standardize the structures and semantics of health datasets, enabling reproducibility and large-scale studies that leverage the data from multiple locations and settings. The Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) is one of the leading common data models. While there is a strong incentive to convert datasets to OMOP, the conversion is time and resource-intensive, leaving the research community in need of tools for mapping data to OMOP. We propose an extract, transform, load (ETL) framework that is metadata-driven and generic across source datasets. The ETL framework uses a new data manipulation language (DML) that organizes SQL snippets in YAML. Our framework includes a compiler that converts YAML files with mapping logic into an ETL script. Access to the ETL framework is available via a web application, allowing users to upload and edit YAML files via web editor and obtain an ETL SQL script for use in development environments. The structure of the DML maximizes readability, refactoring, and maintainability, while minimizing technical debt and standardizing the writing of ETL operations for mapping to OMOP. Our framework also supports transparency of the mapping process and reuse by different institutions.
通用数据模型标准化了健康数据集的结构和语义,使数据集能够在多个地点和环境中重复使用,并进行大规模研究。观察性医疗结局伙伴关系通用数据模型(OMOP CDM)是领先的通用数据模型之一。虽然有强烈的动机将数据集转换为 OMOP,但转换过程既耗时又耗资源,这使得研究界需要将数据映射到 OMOP 的工具。我们提出了一个基于元数据的、针对各种源数据集的通用提取、转换、加载(ETL)框架。该 ETL 框架使用一种新的数据操作语言(DML),该语言将 SQL 片段组织在 YAML 中。我们的框架包括一个编译器,它将包含映射逻辑的 YAML 文件转换为 ETL 脚本。通过 Web 应用程序可以访问 ETL 框架,用户可以通过 Web 编辑器上传和编辑 YAML 文件,并获得 ETL SQL 脚本,以便在开发环境中使用。DML 的结构最大限度地提高了可读性、重构性和可维护性,同时最小化了技术债务,并标准化了映射到 OMOP 的 ETL 操作的编写。我们的框架还支持映射过程的透明度和不同机构的重用。