• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

临床研究中的自动化数据协调:自然语言处理方法

Automated Data Harmonization in Clinical Research: Natural Language Processing Approach.

作者信息

Mallya Pratheek, Henao Ricardo, Hong Chuan, Wojdyla Daniel, Schibler Tony, Manchanda Vihaan, Pencina Michael, Hall Jennifer, Zhao Juan

机构信息

American Heart Association, 7272 Greenville Ave, Dallas, TX, 75231, United States, 1 2147061164.

Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States.

出版信息

JMIR Form Res. 2025 Aug 27;9:e75608. doi: 10.2196/75608.

DOI:10.2196/75608
PMID:40874791
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12391522/
Abstract

BACKGROUND

Integrating data is essential for advancing clinical and epidemiological research. However, because datasets often describe variables (eg, demographic and health conditions) in diverse ways, the process of integrating and harmonizing variables from research studies remains a major bottleneck.

OBJECTIVE

The objective was to assess a natural language processing-based method to automate variable harmonization to achieve a scalable approach to integration of multiple datasets.

METHODS

We developed a fully connected neural network (FCN) method, enhanced with contrastive learning, using domain-specific embeddings from the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining language representation model, using 3 cardiovascular datasets: the Atherosclerosis Risk in Communities study, the Framingham Heart Study, and the Multi-Ethnic Study of Atherosclerosis. We used metadata variable descriptions and curated harmonized concepts as ground truth. We framed the problem as a paired sentence classification task. The accuracy of this method was compared with a logistic regression baseline method. To assess the generalizability of the trained models, we also evaluated their performance by separating the 3 datasets when preparing the training and validation sets.

RESULTS

The newly developed FCN achieved a top-5 accuracy of 98.95% (95% CI 98.31%-99.47%) and an area under the receiver operating characteristic (AUC) of 0.99 (95% CI 0.98-0.99), outperforming the standard logistic regression model, which exhibited a top-5 accuracy of 22.23% (95% CI 19.91%-24.87%) and an AUC of 0.82 (95% CI 0.81-0.83). The contrastive learning enhancement also outperformed the logistic regression model, although slightly below the base FCN model, exhibiting a top-5 accuracy of 89.88% (95% CI 87.88%-91.68%) and an AUC of 0.98 (95% CI 0.97-0.98).

CONCLUSIONS

This novel approach provides a scalable solution for harmonizing metadata across large-scale cohort studies. The proposed method significantly enhances the performance over the baseline method by using learned representations to categorize harmonized concepts more accurately for cohorts in cardiovascular disease and stroke.

摘要

背景

整合数据对于推进临床和流行病学研究至关重要。然而,由于数据集通常以不同方式描述变量(例如人口统计学和健康状况),整合和协调来自研究的变量的过程仍然是一个主要瓶颈。

目的

目的是评估一种基于自然语言处理的方法,以自动实现变量协调,从而实现一种可扩展的多数据集整合方法。

方法

我们开发了一种全连接神经网络(FCN)方法,并通过对比学习进行增强,使用来自用于生物医学文本挖掘的双向编码器表征变换器语言表示模型的特定领域嵌入,使用3个心血管数据集:社区动脉粥样硬化风险研究、弗雷明汉心脏研究和动脉粥样硬化多族裔研究。我们将元数据变量描述和精心策划的协调概念用作基准事实。我们将该问题构建为配对句子分类任务。将该方法的准确性与逻辑回归基线方法进行比较。为了评估训练模型的通用性,我们还在准备训练集和验证集时通过分离这3个数据集来评估它们的性能。

结果

新开发的FCN实现了前5准确率为98.95%(95%CI 98.31%-99.47%),受试者工作特征曲线下面积(AUC)为0.99(95%CI 0.98-0.99),优于标准逻辑回归模型,该模型的前5准确率为22.23%(95%CI 19.91%-24.87%),AUC为0.82(95%CI 0.81-0.83)。对比学习增强也优于逻辑回归模型,尽管略低于基础FCN模型,前5准确率为89.88%(95%CI 87.88%-91.68%),AUC为0.98(95%CI 0.97-0.98)。

结论

这种新颖的方法为跨大规模队列研究协调元数据提供了一种可扩展的解决方案。所提出的方法通过使用学习到的表征更准确地对心血管疾病和中风队列的协调概念进行分类,显著提高了相对于基线方法的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33d4/12391522/09ff10762b4c/formative-v9-e75608-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33d4/12391522/45a99ba62409/formative-v9-e75608-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33d4/12391522/09ff10762b4c/formative-v9-e75608-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33d4/12391522/45a99ba62409/formative-v9-e75608-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33d4/12391522/09ff10762b4c/formative-v9-e75608-g002.jpg

相似文献

1
Automated Data Harmonization in Clinical Research: Natural Language Processing Approach.临床研究中的自动化数据协调:自然语言处理方法
JMIR Form Res. 2025 Aug 27;9:e75608. doi: 10.2196/75608.
2
A natural language processing approach to support biomedical data harmonization: Leveraging large language models.一种支持生物医学数据协调的自然语言处理方法:利用大语言模型。
PLoS One. 2025 Jul 24;20(7):e0328262. doi: 10.1371/journal.pone.0328262. eCollection 2025.
3
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
4
Development of a Large-Scale Dataset of Chest Computed Tomography Reports in Japanese and a High-Performance Finding Classification Model: Dataset Development and Validation Study.日语胸部计算机断层扫描报告大规模数据集的开发及高性能发现分类模型:数据集开发与验证研究
JMIR Med Inform. 2025 Aug 28;13:e71137. doi: 10.2196/71137.
5
Predicting Drug-Side Effect Relationships From Parametric Knowledge Embedded in Biomedical BERT Models: Methodological Study With a Natural Language Processing Approach.从生物医学BERT模型中嵌入的参数知识预测药物副作用关系:一种自然语言处理方法的方法学研究
JMIR Med Inform. 2025 Jul 10;13:e67513. doi: 10.2196/67513.
6
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
7
Radiology report generation using automatic keyword adaptation, frequency-based multi-label classification and text-to-text large language models.使用自动关键词适配、基于频率的多标签分类和文本到文本的大语言模型生成放射学报告。
Comput Biol Med. 2025 Jul 3;196(Pt A):110625. doi: 10.1016/j.compbiomed.2025.110625.
8
Short-Term Memory Impairment短期记忆障碍
9
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
10
Development of a Natural Language Processing Model for Extracting Kidney Biopsy Pathology Diagnoses.用于提取肾活检病理诊断的自然语言处理模型的开发
Kidney Med. 2025 Jun 14;7(8):101047. doi: 10.1016/j.xkme.2025.101047. eCollection 2025 Aug.

本文引用的文献

1
A natural language processing approach to support biomedical data harmonization: Leveraging large language models.一种支持生物医学数据协调的自然语言处理方法:利用大语言模型。
PLoS One. 2025 Jul 24;20(7):e0328262. doi: 10.1371/journal.pone.0328262. eCollection 2025.
2
Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study.通过集成机器学习对异构数据进行稳健的自动协调:算法开发与验证研究
JMIR Med Inform. 2025 Jan 22;13:e54133. doi: 10.2196/54133.
3
Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels.
用于训练带有噪声标签的深度神经网络的广义交叉熵损失
Adv Neural Inf Process Syst. 2018 Dec;32:8792-8802. Epub 2018 Dec 3.
4
Use of Metadata-Driven Approaches for Data Harmonization in the Medical Domain: Scoping Review.医学领域中使用元数据驱动方法进行数据协调:范围综述
JMIR Med Inform. 2024 Feb 14;12:e52967. doi: 10.2196/52967.
5
A General Primer for Data Harmonization.数据协调通用指南
Sci Data. 2024 Jan 31;11(1):152. doi: 10.1038/s41597-024-02956-3.
6
Facilitating Harmonization of Variables in Framingham, MESA, ARIC, and REGARDS Studies Through a Metadata Repository.通过元数据存储库促进弗雷明汉、MESA、ARIC 和 REGARDS 研究中变量的协调。
Circ Cardiovasc Qual Outcomes. 2023 Nov;16(11):e009938. doi: 10.1161/CIRCOUTCOMES.123.009938. Epub 2023 Oct 18.
7
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
8
Large-Scale Data Harmonization Across Prospective Studies.大规模数据在前瞻性研究中的协调。
Am J Epidemiol. 2023 Nov 10;192(12):2033-2049. doi: 10.1093/aje/kwad153.
9
On the effectiveness of compact biomedical transformers.紧凑型生物医学变压器的有效性。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad103.
10
Heart Disease and Stroke Statistics-2023 Update: A Report From the American Heart Association.《心脏病与卒中统计数据-2023 更新:美国心脏协会报告》。
Circulation. 2023 Feb 21;147(8):e93-e621. doi: 10.1161/CIR.0000000000001123. Epub 2023 Jan 25.