Suppr超能文献

CDEMapper:利用大语言模型增强美国国立卫生研究院通用数据元素的使用

CDEMapper: enhancing National Institutes of Health common data element use with large language models.

作者信息

Wang Yan, Huang Jimin, He Huan, Zhang Vincent, Zhou Yujia, Hao Xubing, Ram Pritham, Qian Lingfei, Xie Qianqian, Weng Ruey-Ling, Lin Fongci, Hu Yan, Cui Licong, Jiang Xiaoqian, Xu Hua, Hong Na

机构信息

Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.

McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States.

出版信息

J Am Med Inform Assoc. 2025 Jul 1;32(7):1130-1139. doi: 10.1093/jamia/ocaf064.

Abstract

OBJECTIVE

Common Data Elements (CDEs) standardize data collection and sharing across studies, enhancing data interoperability and improving research reproducibility. However, implementing CDEs presents challenges due to the broad range and variety of data elements. This study aims to develop a CDE mapping tool to bridge the gap between local data elements and National Institutes of Health (NIH) CDEs.

METHODS

We propose CDEMapper, a large language model (LLM)-powered mapping tool designed to assist in mapping local data elements to NIH CDEs. CDEMapper has 3 core modules: (1) CDE indexing and embeddings. NIH CDEs were indexed and embedded to support semantic search; (2) CDE recommendations. The tool combines Elasticsearch (BM25 methods) with GPT services to recommend candidate CDEs and their permissible values; and (3) Human review. Users review and select the best match for their data elements and value sets. We evaluate the tool's recommendation accuracy and usability against manual annotations and testing.

RESULTS

CDEMapper offers a publicly available, LLM-powered, and intuitive user interface that consolidates essential and advanced mapping services into a streamlined pipeline. The evaluation results demonstrated that the augmented BM25 with GPT embeddings and a GPT ranker achieved the overall best performance. The usability test also highlighted the effectiveness and efficiency of our tool.

DISCUSSIONS AND CONCLUSIONS

This work opens up the potential of using LLMs to assist with CDE mapping when aligning local data elements with NIH CDEs. Additionally, this effort helps researchers better understand the gaps between their data elements and NIH CDEs while promoting CDE reusability.

摘要

目的

通用数据元素(CDE)使各项研究的数据收集和共享标准化,增强了数据的互操作性并提高了研究的可重复性。然而,由于数据元素的范围广泛且种类繁多,实施CDE面临挑战。本研究旨在开发一种CDE映射工具,以弥合本地数据元素与美国国立卫生研究院(NIH)CDE之间的差距。

方法

我们提出了CDEMapper,这是一种由大语言模型(LLM)驱动的映射工具,旨在协助将本地数据元素映射到NIH CDE。CDEMapper有3个核心模块:(1)CDE索引与嵌入。对NIH CDE进行索引和嵌入以支持语义搜索;(2)CDE推荐。该工具将Elasticsearch(BM25方法)与GPT服务相结合,以推荐候选CDE及其允许的值;(3)人工审核。用户审核并选择与其数据元素和值集最匹配的选项。我们根据人工注释和测试评估该工具的推荐准确性和可用性。

结果

CDEMapper提供了一个公开可用的、由LLM驱动的直观用户界面,将基本和高级映射服务整合到一个简化的流程中。评估结果表明,结合GPT嵌入和GPT排序器的增强型BM25取得了总体最佳性能。可用性测试也突出了我们工具的有效性和效率。

讨论与结论

这项工作开启了在将本地数据元素与NIH CDE对齐时使用大语言模型协助CDE映射的潜力。此外,这项工作有助于研究人员更好地理解其数据元素与NIH CDE之间的差距,同时促进CDE的可重复使用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e61c/12202029/d58fb31c377d/ocaf064f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验