Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States.
Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States.
Methods Inf Med. 2022 May;61(1-02):3-10. doi: 10.1055/s-0041-1739361. Epub 2021 Nov 24.
Data harmonization is essential to integrate individual participant data from multiple sites, time periods, and trials for meta-analysis. The process of mapping terms and phrases to an ontology is complicated by typographic errors, abbreviations, truncation, and plurality. We sought to harmonize medical history (MH) and adverse events (AE) term records across 21 randomized clinical trials in pulmonary arterial hypertension and chronic thromboembolic pulmonary hypertension.
We developed and applied a semi-automated harmonization pipeline for use with domain-expert annotators to resolve ambiguous term mappings using exact and fuzzy matching. We summarized MH and AE term mapping success, including map quality measures, and imputation of a generalizing term hierarchy as defined by the applied Medical Dictionary for Regulatory Activities (MedDRA) ontology standard.
Over 99.6% of both MH ( = 37,105) and AE ( = 58,170) records were successfully mapped to MedDRA low-level terms. Automated exact matching accounted for 74.9% of MH and 85.5% of AE mappings. Term recommendations from fuzzy matching in the pipeline facilitated annotator mapping of the remaining 24.9% of MH and 13.8% of AE records. Imputation of the generalized MedDRA term hierarchy was unambiguous in 85.2% of high-level terms, 99.4% of high-level group terms, and 99.5% of system organ class in MH, and 75% of high-level terms, 98.3% of high-level group terms, and 98.4% of system organ class in AE.
This pipeline dramatically reduced the burden of manual annotation for MH and AE term harmonization and could be adapted to other data integration efforts.
数据协调对于整合来自多个地点、时间段和试验的个体参与者数据进行荟萃分析至关重要。将术语和短语映射到本体的过程由于印刷错误、缩写、截断和复数形式而变得复杂。我们试图协调 21 项肺动脉高压和慢性血栓栓塞性肺动脉高压随机临床试验中的病史 (MH) 和不良事件 (AE) 术语记录。
我们开发并应用了一个半自动协调管道,供领域专家注释器使用,以使用精确和模糊匹配来解决模糊的术语映射。我们总结了 MH 和 AE 术语映射的成功,包括映射质量度量以及根据应用的监管活动医学词典 (MedDRA) 本体标准推断概括术语层次结构。
超过 99.6%的 MH( = 37105)和 AE( = 58170)记录成功映射到 MedDRA 低级术语。自动精确匹配占 MH 的 74.9%和 AE 的 85.5%。管道中的模糊匹配推荐的术语有助于注释器映射 MH 剩余的 24.9%和 AE 记录的 13.8%。在 MH 中,广义 MedDRA 术语层次结构的推断在 85.2%的高级术语、99.4%的高级组术语和 99.5%的系统器官类别中是明确的,在 75%的高级术语、98.3%的高级组术语和 98.4%的系统器官类别中是明确的。AE。
该管道大大减轻了 MH 和 AE 术语协调的手动注释负担,并且可以适应其他数据集成工作。