Adams Meredith C B, Perkins Matthew L, Hudson Cody, Madhira Vithal, Akbilgic Oguz, Ma Da, Hurley Robert W, Topaloglu Umit
Department of Anesthesiology, Artificial Intelligence, Translational Neuroscience, and Public Health Sciences, Wake Forest University School of Medicine, Winston-Salem, NC, United States.
Department of Cancer Biology, Wake Forest University School of Medicine, Winston-Salem, NC, United States.
J Med Internet Res. 2025 May 15;27:e69004. doi: 10.2196/69004.
The integration of diverse clinical data sources requires standardization through models such as Observational Medical Outcomes Partnership (OMOP). However, mapping data elements to OMOP concepts demands significant technical expertise and time. While large health care systems often have resources for OMOP conversion, smaller clinical trials and studies frequently lack such support, leaving valuable research data siloed.
This study aims to develop and validate a user-friendly tool that leverages large language models to automate the OMOP conversion process for clinical trials, electronic health records, and registry data.
We developed a 3-tiered semantic matching system using GPT-3 embeddings to transform heterogeneous clinical data to the OMOP Common Data Model. The system processes input terms by generating vector embeddings, computing cosine similarity against precomputed Observational Health Data Sciences and Informatics vocabulary embeddings, and ranking potential matches. We validated the system using two independent datasets: (1) a development set of 76 National Institutes of Health Helping to End Addiction Long-term Initiative clinical trial common data elements for chronic pain and opioid use disorders and (2) a separate validation set of electronic health record concepts from the National Institutes of Health National COVID Cohort Collaborative COVID-19 enclave. The architecture combines Unified Medical Language System semantic frameworks with asynchronous processing for efficient concept mapping, made available through an open-source implementation.
The system achieved an area under the receiver operating characteristic curve of 0.9975 for mapping clinical trial common data element terms. Precision ranged from 0.92 to 0.99 and recall ranged from 0.88 to 0.97 across similarity thresholds from 0.85 to 1.0. In practical application, the tool successfully automated mappings that previously required manual informatics expertise, reducing the technical barriers for research teams to participate in large-scale, data-sharing initiatives. Representative mappings demonstrated high accuracy, such as demographic terms achieving 100% similarity with corresponding Logical Observation Identifiers Names and Codes concepts. The implementation successfully processes diverse data types through both individual term mapping and batch processing capabilities.
Our validated large language model-based tool effectively automates the transformation of clinical data into the OMOP format while maintaining high accuracy. The combination of semantic matching capabilities and a researcher-friendly interface makes data harmonization accessible to smaller research teams without requiring extensive informatics support. This has direct implications for accelerating clinical research data standardization and enabling broader participation in initiatives such as the National Institutes of Health Helping to End Addiction Long-term Initiative Data Ecosystem.
整合多样的临床数据源需要通过诸如观察性医疗结果合作组织(OMOP)等模型进行标准化。然而,将数据元素映射到OMOP概念需要大量的技术专长和时间。大型医疗保健系统通常有资源进行OMOP转换,而较小的临床试验和研究往往缺乏这种支持,导致宝贵的研究数据被孤立起来。
本研究旨在开发并验证一种用户友好的工具,该工具利用大语言模型自动完成临床试验、电子健康记录和注册数据的OMOP转换过程。
我们使用GPT-3嵌入开发了一个三层语义匹配系统,以将异构临床数据转换为OMOP通用数据模型。该系统通过生成向量嵌入、针对预先计算的观察性健康数据科学与信息学词汇嵌入计算余弦相似度以及对潜在匹配项进行排序来处理输入术语。我们使用两个独立的数据集对该系统进行了验证:(1)一组包含76个美国国立卫生研究院帮助终结成瘾长期倡议临床试验慢性疼痛和阿片类药物使用障碍通用数据元素的开发集,以及(2)来自美国国立卫生研究院国家COVID队列协作COVID-19专区的电子健康记录概念的单独验证集。该架构将统一医学语言系统语义框架与异步处理相结合,以实现高效的概念映射,并通过开源实现提供。
该系统在映射临床试验通用数据元素术语时,受试者工作特征曲线下面积达到0.9975。在0.85至1.0的相似度阈值范围内,精确率从0.92到0.99不等,召回率从0.88到0.97不等。在实际应用中,该工具成功地自动完成了以前需要手动信息学专业知识的映射,降低了研究团队参与大规模数据共享计划的技术障碍。代表性映射显示出高准确性,例如人口统计学术语与相应的逻辑观察标识符名称和代码概念的相似度达到100%。该实现通过单个术语映射和批处理功能成功处理了各种数据类型。
我们经过验证的基于大语言模型的工具有效地将临床数据自动转换为OMOP格式,同时保持了高准确性。语义匹配能力与研究人员友好界面的结合使较小的研究团队无需广泛的信息学支持就能实现数据协调。这对于加速临床研究数据标准化以及使更多人能够参与诸如美国国立卫生研究院帮助终结成瘾长期倡议数据生态系统等计划具有直接意义。