Medical Centre for Information and Communication Technology, Universitätsklinikum Erlangen, Erlangen, Germany.
Chair of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany.
Appl Clin Inform. 2019 Aug;10(4):679-692. doi: 10.1055/s-0039-1695793. Epub 2019 Sep 11.
High-quality clinical data and biological specimens are key for medical research and personalized medicine. The Biobanking and Biomolecular Resources Research Infrastructure-European Research Infrastructure Consortium (BBMRI-ERIC) aims to facilitate access to such biological resources. The accompanying ADOPT BBMRI-ERIC project kick-started BBMRI-ERIC by collecting colorectal cancer data from European biobanks.
To transform these data into a common representation, a uniform approach for data integration and harmonization had to be developed. This article describes the design and the implementation of a toolset for this task.
Based on the semantics of a metadata repository, we developed a lexical bag-of-words matcher, capable of semiautomatically mapping local biobank terms to the central ADOPT BBMRI-ERIC terminology. Its algorithm supports fuzzy matching, utilization of synonyms, and sentiment tagging. To process the anonymized instance data based on these mappings, we also developed a data transformation application.
The implementation was used to process the data from 10 European biobanks. The lexical matcher automatically and correctly mapped 78.48% of the 1,492 local biobank terms, and human experts were able to complete the remaining mappings. We used the expert-curated mappings to successfully process 147,608 data records from 3,415 patients.
A generic harmonization approach was created and successfully used for cross-institutional data harmonization across 10 European biobanks. The software tools were made available as open source.
高质量的临床数据和生物样本是医学研究和个性化医疗的关键。生物库和生物分子资源研究基础设施-欧洲研究基础设施联盟(BBMRI-ERIC)旨在促进对这些生物资源的访问。伴随而来的 ADOPT BBMRI-ERIC 项目通过从欧洲生物库中收集结直肠癌数据,启动了 BBMRI-ERIC。
为了将这些数据转化为通用表示,必须开发一种统一的方法来进行数据集成和协调。本文描述了用于此任务的工具集的设计和实现。
基于元数据存储库的语义,我们开发了一个词汇袋字匹配器,能够半自动地将本地生物库术语映射到中央 ADOPT BBMRI-ERIC 术语。其算法支持模糊匹配、同义词的利用和情感标记。为了基于这些映射处理匿名实例数据,我们还开发了一个数据转换应用程序。
该实现用于处理来自 10 个欧洲生物库的数据。词汇匹配器自动且正确地映射了 1492 个本地生物库术语中的 78.48%,而人类专家能够完成其余的映射。我们使用专家策划的映射成功地处理了来自 3415 名患者的 147608 条数据记录。
创建了一种通用的协调方法,并成功地用于跨 10 个欧洲生物库的机构间数据协调。软件工具作为开源提供。