Wang Yan, Huang Jimin, He Huan, Zhang Vincent, Zhou Yujia, Hao Xubing, Ram Pritham, Qian Lingfei, Xie Qianqian, Weng Ruey-Ling, Lin Fongci, Hu Yan, Cui Licong, Jiang Xiaoqian, Xu Hua, Hong Na
Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.
McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States.
J Am Med Inform Assoc. 2025 Jul 1;32(7):1130-1139. doi: 10.1093/jamia/ocaf064.
Common Data Elements (CDEs) standardize data collection and sharing across studies, enhancing data interoperability and improving research reproducibility. However, implementing CDEs presents challenges due to the broad range and variety of data elements. This study aims to develop a CDE mapping tool to bridge the gap between local data elements and National Institutes of Health (NIH) CDEs.
We propose CDEMapper, a large language model (LLM)-powered mapping tool designed to assist in mapping local data elements to NIH CDEs. CDEMapper has 3 core modules: (1) CDE indexing and embeddings. NIH CDEs were indexed and embedded to support semantic search; (2) CDE recommendations. The tool combines Elasticsearch (BM25 methods) with GPT services to recommend candidate CDEs and their permissible values; and (3) Human review. Users review and select the best match for their data elements and value sets. We evaluate the tool's recommendation accuracy and usability against manual annotations and testing.
CDEMapper offers a publicly available, LLM-powered, and intuitive user interface that consolidates essential and advanced mapping services into a streamlined pipeline. The evaluation results demonstrated that the augmented BM25 with GPT embeddings and a GPT ranker achieved the overall best performance. The usability test also highlighted the effectiveness and efficiency of our tool.
This work opens up the potential of using LLMs to assist with CDE mapping when aligning local data elements with NIH CDEs. Additionally, this effort helps researchers better understand the gaps between their data elements and NIH CDEs while promoting CDE reusability.
通用数据元素(CDE)使各项研究的数据收集和共享标准化,增强了数据的互操作性并提高了研究的可重复性。然而,由于数据元素的范围广泛且种类繁多,实施CDE面临挑战。本研究旨在开发一种CDE映射工具,以弥合本地数据元素与美国国立卫生研究院(NIH)CDE之间的差距。
我们提出了CDEMapper,这是一种由大语言模型(LLM)驱动的映射工具,旨在协助将本地数据元素映射到NIH CDE。CDEMapper有3个核心模块:(1)CDE索引与嵌入。对NIH CDE进行索引和嵌入以支持语义搜索;(2)CDE推荐。该工具将Elasticsearch(BM25方法)与GPT服务相结合,以推荐候选CDE及其允许的值;(3)人工审核。用户审核并选择与其数据元素和值集最匹配的选项。我们根据人工注释和测试评估该工具的推荐准确性和可用性。
CDEMapper提供了一个公开可用的、由LLM驱动的直观用户界面,将基本和高级映射服务整合到一个简化的流程中。评估结果表明,结合GPT嵌入和GPT排序器的增强型BM25取得了总体最佳性能。可用性测试也突出了我们工具的有效性和效率。
这项工作开启了在将本地数据元素与NIH CDE对齐时使用大语言模型协助CDE映射的潜力。此外,这项工作有助于研究人员更好地理解其数据元素与NIH CDE之间的差距,同时促进CDE的可重复使用性。