基于深度学习的OMOP通用数据模型中的自动术语映射

Deep-learning-based automated terminology mapping in OMOP-CDM.

作者信息

Kang Byungkon, Yoon Jisang, Kim Ha Young, Jo Sung Jin, Lee Yourim, Kam Hye Jin

机构信息

Department of Computer Science, State University of New York, Incheon, South Korea.

Graduate School of Information, Yonsei University, Seoul, South Korea.

出版信息

J Am Med Inform Assoc. 2021 Jul 14;28(7):1489-1496. doi: 10.1093/jamia/ocab030.

DOI:10.1093/jamia/ocab030

PMID:33987667

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8279781/

Abstract

OBJECTIVE

Accessing medical data from multiple institutions is difficult owing to the interinstitutional diversity of vocabularies. Standardization schemes, such as the common data model, have been proposed as solutions to this problem, but such schemes require expensive human supervision. This study aims to construct a trainable system that can automate the process of semantic interinstitutional code mapping.

MATERIALS AND METHODS

To automate mapping between source and target codes, we compute the embedding-based semantic similarity between corresponding descriptive sentences. We also implement a systematic approach for preparing training data for similarity computation. Experimental results are compared to traditional word-based mappings.

RESULTS

The proposed model is compared against the state-of-the-art automated matching system, which is called Usagi, of the Observational Medical Outcomes Partnership common data model. By incorporating multiple negative training samples per positive sample, our semantic matching method significantly outperforms Usagi. Its matching accuracy is at least 10% greater than that of Usagi, and this trend is consistent across various top-k measurements.

DISCUSSION

The proposed deep learning-based mapping approach outperforms previous simple word-level matching algorithms because it can account for contextual and semantic information. Additionally, we demonstrate that the manner in which negative training samples are selected significantly affects the overall performance of the system.

CONCLUSION

Incorporating the semantics of code descriptions more significantly increases matching accuracy compared to traditional text co-occurrence-based approaches. The negative training sample collection methodology is also an important component of the proposed trainable system that can be adopted in both present and future related systems.

摘要

目的

由于机构间词汇的多样性，从多个机构获取医学数据存在困难。诸如通用数据模型之类的标准化方案已被提出作为解决此问题的方法，但此类方案需要昂贵的人工监督。本研究旨在构建一个可训练的系统，该系统能够自动执行语义机构间代码映射的过程。

材料与方法

为了自动进行源代码和目标代码之间的映射，我们计算相应描述性句子之间基于嵌入的语义相似度。我们还实施了一种系统的方法来准备用于相似度计算的训练数据。将实验结果与传统的基于单词的映射进行比较。

结果

将所提出的模型与观察性医疗结果合作组织通用数据模型中最先进的自动匹配系统（称为“玉兔”）进行比较。通过为每个正样本合并多个负训练样本，我们的语义匹配方法显著优于“玉兔”。其匹配准确率比“玉兔”至少高10%，并且在各种前k测量中这一趋势都是一致的。

讨论

所提出的基于深度学习的映射方法优于先前简单的单词级匹配算法，因为它可以考虑上下文和语义信息。此外，我们证明了选择负训练样本的方式会显著影响系统的整体性能。

结论

与传统的基于文本共现的方法相比，纳入代码描述的语义能更显著地提高匹配准确率。负训练样本收集方法也是所提出的可训练系统的一个重要组成部分，可在当前和未来的相关系统中采用。

相似文献

Deep-learning-based automated terminology mapping in OMOP-CDM.基于深度学习的OMOP通用数据模型中的自动术语映射

J Am Med Inform Assoc. 2021 Jul 14;28(7):1489-1496. doi: 10.1093/jamia/ocab030.

Assessing the Use of German Claims Data Vocabularies for Research in the Observational Medical Outcomes Partnership Common Data Model: Development and Evaluation Study.评估德国索赔数据词汇表在观察性医疗结局合作组织通用数据模型研究中的应用：开发与评估研究

JMIR Med Inform. 2023 Nov 7;11:e47959. doi: 10.2196/47959.

Automatic SNOMED CT coding of Chinese clinical terms via attention-based semantic matching.通过基于注意力的语义匹配对中文临床术语进行自动SNOMED CT编码。

Int J Med Inform. 2022 Mar;159:104676. doi: 10.1016/j.ijmedinf.2021.104676. Epub 2021 Dec 28.

A Semantic Transformation Methodology for the Secondary Use of Observational Healthcare Data in Postmarketing Safety Studies.一种用于上市后安全性研究中观察性医疗保健数据二次利用的语义转换方法。

Front Pharmacol. 2018 Apr 30;9:435. doi: 10.3389/fphar.2018.00435. eCollection 2018.

Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.使用词和图嵌入来衡量统一医学语言系统概念之间的语义相关性。

J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.

Towards quality improvement of vaccine concept mappings in the OMOP vocabulary with a semi-automated method.采用半自动方法提高 OMOP 词汇表中疫苗概念图的质量。

J Biomed Inform. 2022 Oct;134:104162. doi: 10.1016/j.jbi.2022.104162. Epub 2022 Aug 25.

IARNN-Based Semantic-Containing Double-Level Embedding Bi-LSTM for Question-and-Answer Matching.基于 IARNN 的语义包含双层嵌入双向 LSTM 的问答匹配

Comput Intell Neurosci. 2019 Mar 3;2019:6074840. doi: 10.1155/2019/6074840. eCollection 2019.

A multi-dimensional fusion strategy similarity measure method for patent application technology disclosure document.专利申请技术交底文档的多维融合策略相似度测度方法。

PLoS One. 2023 Oct 18;18(10):e0293091. doi: 10.1371/journal.pone.0293091. eCollection 2023.

Neural sentence embedding models for semantic similarity estimation in the biomedical domain.生物医学领域中语义相似度估计的神经句子嵌入模型。

BMC Bioinformatics. 2019 Apr 11;20(1):178. doi: 10.1186/s12859-019-2789-2.

BMC Med Inform Decis Mak. 2023 Apr 6;23(1):55. doi: 10.1186/s12911-023-02161-z.

引用本文的文献

A Novel Sentence Transformer-based Natural Language Processing Approach for Schema Mapping of Electronic Health Records to the OMOP Common Data Model.一种基于新型句子变换器的自然语言处理方法，用于将电子健康记录映射到OMOP公共数据模型的模式映射。

AMIA Annu Symp Proc. 2025 May 22;2024:1332-1339. eCollection 2024.

Breaking Digital Health Barriers Through a Large Language Model-Based Tool for Automated Observational Medical Outcomes Partnership Mapping: Development and Validation Study.通过基于大语言模型的自动观察性医学结果伙伴关系映射工具打破数字健康障碍：开发与验证研究

J Med Internet Res. 2025 May 15;27:e69004. doi: 10.2196/69004.

Augmenting the Hospital Score with social risk factors to improve prediction for 30-day readmission following acute myocardial infarction.用社会风险因素增强医院评分以改善急性心肌梗死后30天再入院的预测。

Med Res Arch. 2024 Nov;12(11). doi: 10.18103/mra.v12i11.6089.

Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets - A feasibility study.语义搜索助手：一种基于在多项目问卷中使用嵌入作为合并大型数据集的协调机会的工具——一项可行性研究。

Eur Psychiatry. 2025 Jan 20;68(1):e8. doi: 10.1192/j.eurpsy.2024.1808.

Automated extraction of standardized antibiotic resistance and prescription data from laboratory information systems and electronic health records: a narrative review.从实验室信息系统和电子健康记录中自动提取标准化抗生素耐药性和处方数据：一篇叙述性综述。

Front Antibiot. 2024 Mar 8;3:1380380. doi: 10.3389/frabi.2024.1380380. eCollection 2024.

Just how transformative will AI/ML be for immuno-oncology?人工智能/机器学习将对免疫肿瘤学产生多大的变革性影响？

J Immunother Cancer. 2024 Mar 25;12(3):e007841. doi: 10.1136/jitc-2023-007841.

Implementation of inclusion and exclusion criteria in clinical studies in OHDSI ATLAS software.在 OHDSI ATLAS 软件中实施临床研究的纳入和排除标准。

Sci Rep. 2023 Dec 18;13(1):22457. doi: 10.1038/s41598-023-49560-w.

本文引用的文献

Standardizing Clinical Diagnoses: Evaluating Alternate Terminology Selection.标准化临床诊断：评估替代术语的选择

AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:71-79. eCollection 2020.

Converting clinical document architecture documents to the common data model for incorporating health information exchange data in observational health studies: CDA to CDM.将临床文档架构文档转换为通用数据模型，以便在观察性健康研究中纳入健康信息交换数据：从临床文档架构（CDA）到通用数据模型（CDM）。

J Biomed Inform. 2020 Jul;107:103459. doi: 10.1016/j.jbi.2020.103459. Epub 2020 May 26.

Can We Rely on Results From IQVIA Medical Research Data UK Converted to the Observational Medical Outcome Partnership Common Data Model?: A Validation Study Based on Prescribing Codeine in Children.能否依赖 IQVIA 英国医学研究数据转换为观察性医疗结局合作组织通用数据模型的结果？一项基于儿童可待因处方的验证研究。

Clin Pharmacol Ther. 2020 Apr;107(4):915-925. doi: 10.1002/cpt.1785. Epub 2020 Mar 2.

Transforming French Electronic Health Records into the Observational Medical Outcome Partnership's Common Data Model: A Feasibility Study.将法国电子健康记录转化为观察性医疗结局伙伴关系的通用数据模型：一项可行性研究。

Appl Clin Inform. 2020 Jan;11(1):13-22. doi: 10.1055/s-0039-3402754. Epub 2020 Jan 8.

Incrementally Transforming Electronic Medical Records into the Observational Medical Outcomes Partnership Common Data Model: A Multidimensional Quality Assurance Approach.逐步将电子病历转化为观察性医疗结局伙伴关系通用数据模型：一种多维质量保证方法。

Appl Clin Inform. 2019 Oct;10(5):794-803. doi: 10.1055/s-0039-1697598. Epub 2019 Oct 23.

Data model harmonization for the All Of Us Research Program: Transforming i2b2 data into the OMOP common data model.All Of Us 研究计划的数据模型协调：将 i2b2 数据转换为 OMOP 通用数据模型。

PLoS One. 2019 Feb 19;14(2):e0212463. doi: 10.1371/journal.pone.0212463. eCollection 2019.

Web services for data warehouses: OMOP and PCORnet on i2b2.数据仓库的 Web 服务：i2b2 上的 OMOP 和 PCORnet。

J Am Med Inform Assoc. 2018 Oct 1;25(10):1331-1338. doi: 10.1093/jamia/ocy093.

A review of medical terminology standards and structured reporting.医学术语标准与结构化报告综述。

J Vet Diagn Invest. 2018 Jan;30(1):17-25. doi: 10.1177/1040638717738276. Epub 2017 Oct 15.

Distributed Data Networks That Support Public Health Information Needs.支持公共卫生信息需求的分布式数据网络。

J Public Health Manag Pract. 2017 Nov/Dec;23(6):674-683. doi: 10.1097/PHH.0000000000000614.

Evaluating common data models for use with a longitudinal community registry.评估用于纵向社区登记处的通用数据模型。

J Biomed Inform. 2016 Dec;64:333-341. doi: 10.1016/j.jbi.2016.10.016. Epub 2016 Oct 29.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验