CDEMapper：利用大语言模型增强美国国立卫生研究院通用数据元素的使用

CDEMapper: enhancing National Institutes of Health common data element use with large language models.

作者信息

Wang Yan, Huang Jimin, He Huan, Zhang Vincent, Zhou Yujia, Hao Xubing, Ram Pritham, Qian Lingfei, Xie Qianqian, Weng Ruey-Ling, Lin Fongci, Hu Yan, Cui Licong, Jiang Xiaoqian, Xu Hua, Hong Na

机构信息

Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.

McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States.

出版信息

J Am Med Inform Assoc. 2025 Jul 1;32(7):1130-1139. doi: 10.1093/jamia/ocaf064.

DOI:10.1093/jamia/ocaf064

PMID:40332956

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12202029/

Abstract

OBJECTIVE

Common Data Elements (CDEs) standardize data collection and sharing across studies, enhancing data interoperability and improving research reproducibility. However, implementing CDEs presents challenges due to the broad range and variety of data elements. This study aims to develop a CDE mapping tool to bridge the gap between local data elements and National Institutes of Health (NIH) CDEs.

METHODS

We propose CDEMapper, a large language model (LLM)-powered mapping tool designed to assist in mapping local data elements to NIH CDEs. CDEMapper has 3 core modules: (1) CDE indexing and embeddings. NIH CDEs were indexed and embedded to support semantic search; (2) CDE recommendations. The tool combines Elasticsearch (BM25 methods) with GPT services to recommend candidate CDEs and their permissible values; and (3) Human review. Users review and select the best match for their data elements and value sets. We evaluate the tool's recommendation accuracy and usability against manual annotations and testing.

RESULTS

CDEMapper offers a publicly available, LLM-powered, and intuitive user interface that consolidates essential and advanced mapping services into a streamlined pipeline. The evaluation results demonstrated that the augmented BM25 with GPT embeddings and a GPT ranker achieved the overall best performance. The usability test also highlighted the effectiveness and efficiency of our tool.

DISCUSSIONS AND CONCLUSIONS

This work opens up the potential of using LLMs to assist with CDE mapping when aligning local data elements with NIH CDEs. Additionally, this effort helps researchers better understand the gaps between their data elements and NIH CDEs while promoting CDE reusability.

摘要

目的

通用数据元素（CDE）使各项研究的数据收集和共享标准化，增强了数据的互操作性并提高了研究的可重复性。然而，由于数据元素的范围广泛且种类繁多，实施CDE面临挑战。本研究旨在开发一种CDE映射工具，以弥合本地数据元素与美国国立卫生研究院（NIH）CDE之间的差距。

方法

我们提出了CDEMapper，这是一种由大语言模型（LLM）驱动的映射工具，旨在协助将本地数据元素映射到NIH CDE。CDEMapper有3个核心模块：（1）CDE索引与嵌入。对NIH CDE进行索引和嵌入以支持语义搜索；（2）CDE推荐。该工具将Elasticsearch（BM25方法）与GPT服务相结合，以推荐候选CDE及其允许的值；（3）人工审核。用户审核并选择与其数据元素和值集最匹配的选项。我们根据人工注释和测试评估该工具的推荐准确性和可用性。

结果

CDEMapper提供了一个公开可用的、由LLM驱动的直观用户界面，将基本和高级映射服务整合到一个简化的流程中。评估结果表明，结合GPT嵌入和GPT排序器的增强型BM25取得了总体最佳性能。可用性测试也突出了我们工具的有效性和效率。

讨论与结论

这项工作开启了在将本地数据元素与NIH CDE对齐时使用大语言模型协助CDE映射的潜力。此外，这项工作有助于研究人员更好地理解其数据元素与NIH CDE之间的差距，同时促进CDE的可重复使用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e61c/12202029/d58fb31c377d/ocaf064f1.jpg

相似文献

CDEMapper: enhancing National Institutes of Health common data element use with large language models.CDEMapper：利用大语言模型增强美国国立卫生研究院通用数据元素的使用

J Am Med Inform Assoc. 2025 Jul 1;32(7):1130-1139. doi: 10.1093/jamia/ocaf064.

Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study.使用来自Transformer句子嵌入的语言无关双向编码器表示法检测冗余健康调查问题：算法开发研究

JMIR Med Inform. 2025 Jun 10;13:e71687. doi: 10.2196/71687.

Standardizing imaging findings representation: harnessing Common Data Elements semantics and Fast Healthcare Interoperability Resources structures.标准化影像表现描述：利用通用数据元素语义和快速医疗互操作性资源结构。

J Am Med Inform Assoc. 2024 Aug 1;31(8):1735-1742. doi: 10.1093/jamia/ocae134.

Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验：定性证据综合。

Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.

What is the value of routinely testing full blood count, electrolytes and urea, and pulmonary function tests before elective surgery in patients with no apparent clinical indication and in subgroups of patients with common comorbidities: a systematic review of the clinical and cost-effective literature.在没有明显临床指征的患者和常见合并症患者亚组中，在择期手术前常规检测全血细胞计数、电解质和尿素以及肺功能测试的价值：对临床和成本效益文献的系统评价。

Health Technol Assess. 2012 Dec;16(50):i-xvi, 1-159. doi: 10.3310/hta16500.

Health professionals' experience of teamwork education in acute hospital settings: a systematic review of qualitative literature.医疗专业人员在急症医院环境中团队合作教育的经验：对定性文献的系统综述

JBI Database System Rev Implement Rep. 2016 Apr;14(4):96-137. doi: 10.11124/JBISRIR-2016-1843.

Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.评估和提高大语言模型中的辨证思维能力：方法开发研究

JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.

Large Language Model-Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study.使用修订后的偏倚风险工具在随机对照试验中进行大语言模型辅助的偏倚风险评估：可用性研究

J Med Internet Res. 2025 Jun 24;27:e70450. doi: 10.2196/70450.

Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study.使用大语言模型提高在线患者教育材料的可读性：横断面研究。

J Med Internet Res. 2025 Jun 4;27:e69955. doi: 10.2196/69955.

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测：基于放射学报告的多中心方法学研究

J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.

本文引用的文献

Fine-tuning large language models for rare disease concept normalization.微调大型语言模型以实现罕见病概念规范化。

J Am Med Inform Assoc. 2024 Sep 1;31(9):2076-2083. doi: 10.1093/jamia/ocae133.

Common Data Elements Repository.通用数据元素库。

Med Ref Serv Q. 2024 Apr-Jun;43(2):182-190. doi: 10.1080/02763869.2024.2323896. Epub 2024 May 9.

Mapping of Alzheimer's disease related data elements and the NIH Common Data Elements.阿尔茨海默病相关数据元素与 NIH 通用数据元素的映射。

BMC Med Inform Decis Mak. 2024 Apr 19;24(Suppl 3):103. doi: 10.1186/s12911-024-02500-8.

Common Data Elements for Disorders of Consciousness: Recommendations from the Electrophysiology Working Group.意识障碍的常用数据元素：电生理学工作组的建议。

Neurocrit Care. 2023 Dec;39(3):578-585. doi: 10.1007/s12028-023-01795-1. Epub 2023 Aug 22.

Consensus Recommendations for Standardized Data Elements, Scales, and Time Segmentations in Studies of Human Circadian/Diurnal Biology and Stroke.人类昼夜节律/生物节律和中风研究中标准化数据元素、量表和时间分段的共识建议。

Stroke. 2023 Jul;54(7):1943-1949. doi: 10.1161/STROKEAHA.122.041394. Epub 2023 Jun 5.

The American Academy of Ophthalmology IRIS Registry (Intelligent Research In Sight): current and future state of big data analytics.美国眼科学会虹膜注册中心（智能研究在视野中）：大数据分析的现状和未来。

Curr Opin Ophthalmol. 2022 Sep 1;33(5):394-398. doi: 10.1097/ICU.0000000000000869.

Practice Patterns and Outcomes of Transcatheter Aortic Valve Replacement in the United States and Japan: A Report From Joint Data Harmonization Initiative of STS/ACC TVT and J-TVT.美国和日本经导管主动脉瓣置换术的实践模式和结果：STS/ACC TVT 和 J-TVT 联合数据协调倡议的报告。

J Am Heart Assoc. 2022 Mar 15;11(6):e023848. doi: 10.1161/JAHA.121.023848. Epub 2022 Mar 4.

Overview of retrospective data harmonisation in the MINDMAP project: process and results.MINDMAP 项目回顾性数据协调概述：过程和结果。

J Epidemiol Community Health. 2021 May;75(5):433-441. doi: 10.1136/jech-2020-214259. Epub 2020 Nov 12.

Improving Cancer Data Interoperability: The Promise of the Minimal Common Oncology Data Elements (mCODE) Initiative.改善癌症数据互操作性：最小共同肿瘤学数据元素 (mCODE) 倡议的承诺。

JCO Clin Cancer Inform. 2020 Oct;4:993-1001. doi: 10.1200/CCI.20.00059.

FAIR data sharing: The roles of common data elements and harmonization.公平的数据共享：通用数据元素和协调的作用。

J Biomed Inform. 2020 Jul;107:103421. doi: 10.1016/j.jbi.2020.103421. Epub 2020 May 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

CDEMapper：利用大语言模型增强美国国立卫生研究院通用数据元素的使用

CDEMapper: enhancing National Institutes of Health common data element use with large language models.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

DISCUSSIONS AND CONCLUSIONS

目的

方法

结果

讨论与结论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献