一种支持在线数据协调的安全且可重复使用的软件架构。

A Secure and Reusable Software Architecture for Supporting Online Data Harmonization.

作者信息

Feric Zlatan, Bohm Agostini Nicolas, Beene Daniel, Signes-Pastor Antonio J, Halchenko Yuliya, Watkins Deborah, MacKenzie Debra, Karagas Margaret, Manjourides Justin, Alshawabkeh Akram, Kaeli David

机构信息

Dept. of Electrical and Computer Engineering, Northeastern University.

Community Environmental Health Program, College of Pharmacy, Health Sciences Center, University of New Mexico.

出版信息

Proc IEEE Int Conf Big Data. 2021 Dec;2021:2801-2812. doi: 10.1109/bigdata52589.2021.9671538.

DOI:10.1109/bigdata52589.2021.9671538

PMID:35449545

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9020435/

Abstract

Retrospective data harmonization across multiple research cohorts and studies is frequently done to increase statistical power, provide comparison analysis, and create a richer data source for data mining. However, when combining disparate data sources, harmonization projects face data management and analysis challenges. These include differences in the data dictionaries and variable definitions, privacy concerns surrounding health data representing sensitive populations, and lack of properly defined data models. With the availability of mature open-source web-based database technologies, developing a complete software architecture to overcome the challenges associated with the harmonization process can alleviate many roadblocks. By leveraging state-of-the-art software engineering and database principles, we can ensure data quality and enable cross-center online access and collaboration. This paper outlines a complete software architecture developed and customized using the Django web framework, leveraged to harmonize sensitive data collected from three NIH-support birth cohorts. We describe our framework and show how we successfully overcame challenges faced when harmonizing data from these cohorts. We discuss our efforts in data cleaning, data sharing, data transformation, data visualization, and analytics, while reflecting on what we have learned to date from these harmonized datasets.

摘要

跨多个研究队列和研究进行回顾性数据协调，通常是为了提高统计效力、提供比较分析，并为数据挖掘创建更丰富的数据源。然而，在合并不同的数据源时，协调项目面临数据管理和分析方面的挑战。这些挑战包括数据字典和变量定义的差异、围绕代表敏感人群的健康数据的隐私问题，以及缺乏定义恰当的数据模型。随着成熟的基于网络的开源数据库技术的出现，开发一个完整的软件架构来克服与协调过程相关的挑战，可以消除许多障碍。通过利用最先进的软件工程和数据库原则，我们可以确保数据质量，并实现跨中心的在线访问与协作。本文概述了一个使用Django网络框架开发和定制的完整软件架构，该架构用于协调从三个由美国国立卫生研究院支持的出生队列收集的敏感数据。我们描述了我们的框架，并展示了我们如何成功克服在协调这些队列数据时所面临的挑战。我们讨论了我们在数据清理、数据共享、数据转换、数据可视化和分析方面所做的努力，同时反思我们迄今从这些协调数据集中学到的东西。

相似文献

A Secure and Reusable Software Architecture for Supporting Online Data Harmonization.一种支持在线数据协调的安全且可重复使用的软件架构。

Proc IEEE Int Conf Big Data. 2021 Dec;2021:2801-2812. doi: 10.1109/bigdata52589.2021.9671538.

A review of harmonization methods for studying dietary patterns.饮食模式研究的协调方法综述

Smart Health (Amst). 2022 Mar;23. doi: 10.1016/j.smhl.2021.100263. Epub 2022 Jan 13.

Data harmonization and federated analysis of population-based studies: the BioSHaRE project.基于人群研究的数据协调与联合分析：BioSHaRE项目。

Emerg Themes Epidemiol. 2013 Nov 21;10(1):12. doi: 10.1186/1742-7622-10-12.

The project data sphere initiative: accelerating cancer research by sharing data.项目数据领域计划：通过数据共享加速癌症研究

Oncologist. 2015 May;20(5):464-e20. doi: 10.1634/theoncologist.2014-0431. Epub 2015 Apr 15.

Data Integration for Future Medicine (DIFUTURE).未来医学数据集成（DIFUTURE）

Methods Inf Med. 2018 Jul;57(S 01):e57-e65. doi: 10.3414/ME17-02-0022. Epub 2018 Jul 17.

Initiatives, Concepts, and Implementation Practices of FAIR (Findable, Accessible, Interoperable, and Reusable) Data Principles in Health Data Stewardship Practice: Protocol for a Scoping Review.健康数据管理实践中FAIR（可查找、可访问、可互操作和可重用）数据原则的倡议、概念及实施实践：一项范围综述方案

JMIR Res Protoc. 2021 Feb 2;10(2):e22505. doi: 10.2196/22505.

Harmonizing data on correlates of sleep in children within and across neurodevelopmental disorders: lessons learned from an Ontario Brain Institute cross-program collaboration.协调神经发育障碍儿童睡眠相关因素的数据：安大略脑研究所跨项目合作的经验教训。

Front Neuroinform. 2024 May 17;18:1385526. doi: 10.3389/fninf.2024.1385526. eCollection 2024.

The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment.国家 COVID 队列协作组织（N3C）：原理、设计、基础设施和部署。

J Am Med Inform Assoc. 2021 Mar 1;28(3):427-443. doi: 10.1093/jamia/ocaa196.

Facilitating Harmonization of Variables in Framingham, MESA, ARIC, and REGARDS Studies Through a Metadata Repository.通过元数据存储库促进弗雷明汉、MESA、ARIC 和 REGARDS 研究中变量的协调。

Circ Cardiovasc Qual Outcomes. 2023 Nov;16(11):e009938. doi: 10.1161/CIRCOUTCOMES.123.009938. Epub 2023 Oct 18.

Development of an open-source, flexible framework for complex inter-institutional disparate data sharing and collaboration.

AMIA Jt Summits Transl Sci Proc. 2013 Mar 18;2013:103. eCollection 2013.

本文引用的文献

Urinary specific gravity measures in the U.S. population: Implications for the adjustment of non-persistent chemical urinary biomarker data.美国人群的尿比重测量：对非持久性化学物尿生物标志物数据调整的影响。

Environ Int. 2021 Nov;156:106656. doi: 10.1016/j.envint.2021.106656. Epub 2021 May 29.

Machado: Open source genomics data integration framework.马查多：开源基因组学数据集成框架。

Gigascience. 2020 Sep 14;9(9). doi: 10.1093/gigascience/giaa097.

Exposure to uranium and co-occurring metals among pregnant Navajo women.孕妇纳瓦霍妇女体内铀及共存金属的暴露情况。

Environ Res. 2020 Nov;190:109943. doi: 10.1016/j.envres.2020.109943. Epub 2020 Jul 17.

The challenges in data integration - heterogeneity and complexity in clinical trials and patient registries of Systemic Lupus Erythematosus.数据集成面临的挑战 - 系统性红斑狼疮临床试验和患者登记处的异质性和复杂性。

BMC Med Res Methodol. 2020 Jun 24;20(1):164. doi: 10.1186/s12874-020-01057-0.

Sharing SRP data to reduce environmentally associated disease and promote transdisciplinary research.分享 SRP 数据以减少与环境相关的疾病并促进跨学科研究。

Rev Environ Health. 2020 Jun 25;35(2):111-122. doi: 10.1515/reveh-2019-0089.

Prenatal exposure to metal mixture and sex-specific birth outcomes in the New Hampshire Birth Cohort Study.新罕布什尔州出生队列研究中孕期暴露于金属混合物与特定性别的出生结局

Environ Epidemiol. 2019 Oct;3(5). doi: 10.1097/EE9.0000000000000068.

Environmental phthalate exposure and preterm birth in the PROTECT birth cohort.环境中邻苯二甲酸酯暴露与 PROTECT 出生队列的早产

Environ Int. 2019 Nov;132:105099. doi: 10.1016/j.envint.2019.105099. Epub 2019 Aug 17.

A Review of Metal Exposure Studies Conducted in the Rural Southwestern and Mountain West Region of the United States.美国西南部农村和西部山区金属暴露研究综述

Curr Epidemiol Rep. 2019 Mar;6(1):34-49. doi: 10.1007/s40471-019-0182-3. Epub 2019 Feb 12.

Development and use of a flexible data harmonization platform to facilitate the harmonization of individual patient data for meta-analyses.开发和使用一个灵活的数据协调平台，以促进个体患者数据的协调用于荟萃分析。

BMC Res Notes. 2019 Mar 22;12(1):164. doi: 10.1186/s13104-019-4210-7.

A visual interactive analytic tool for filtering and summarizing large health data sets coded with hierarchical terminologies (VIADS).用于过滤和总结使用分层术语编码的大型健康数据集的可视化交互式分析工具 (VIADS)。

BMC Med Inform Decis Mak. 2019 Feb 14;19(1):31. doi: 10.1186/s12911-019-0750-y.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。