Suppr超能文献

临床研究中的自动化数据协调:自然语言处理方法

Automated Data Harmonization in Clinical Research: Natural Language Processing Approach.

作者信息

Mallya Pratheek, Henao Ricardo, Hong Chuan, Wojdyla Daniel, Schibler Tony, Manchanda Vihaan, Pencina Michael, Hall Jennifer, Zhao Juan

机构信息

American Heart Association, 7272 Greenville Ave, Dallas, TX, 75231, United States, 1 2147061164.

Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States.

出版信息

JMIR Form Res. 2025 Aug 27;9:e75608. doi: 10.2196/75608.

Abstract

BACKGROUND

Integrating data is essential for advancing clinical and epidemiological research. However, because datasets often describe variables (eg, demographic and health conditions) in diverse ways, the process of integrating and harmonizing variables from research studies remains a major bottleneck.

OBJECTIVE

The objective was to assess a natural language processing-based method to automate variable harmonization to achieve a scalable approach to integration of multiple datasets.

METHODS

We developed a fully connected neural network (FCN) method, enhanced with contrastive learning, using domain-specific embeddings from the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining language representation model, using 3 cardiovascular datasets: the Atherosclerosis Risk in Communities study, the Framingham Heart Study, and the Multi-Ethnic Study of Atherosclerosis. We used metadata variable descriptions and curated harmonized concepts as ground truth. We framed the problem as a paired sentence classification task. The accuracy of this method was compared with a logistic regression baseline method. To assess the generalizability of the trained models, we also evaluated their performance by separating the 3 datasets when preparing the training and validation sets.

RESULTS

The newly developed FCN achieved a top-5 accuracy of 98.95% (95% CI 98.31%-99.47%) and an area under the receiver operating characteristic (AUC) of 0.99 (95% CI 0.98-0.99), outperforming the standard logistic regression model, which exhibited a top-5 accuracy of 22.23% (95% CI 19.91%-24.87%) and an AUC of 0.82 (95% CI 0.81-0.83). The contrastive learning enhancement also outperformed the logistic regression model, although slightly below the base FCN model, exhibiting a top-5 accuracy of 89.88% (95% CI 87.88%-91.68%) and an AUC of 0.98 (95% CI 0.97-0.98).

CONCLUSIONS

This novel approach provides a scalable solution for harmonizing metadata across large-scale cohort studies. The proposed method significantly enhances the performance over the baseline method by using learned representations to categorize harmonized concepts more accurately for cohorts in cardiovascular disease and stroke.

摘要

背景

整合数据对于推进临床和流行病学研究至关重要。然而,由于数据集通常以不同方式描述变量(例如人口统计学和健康状况),整合和协调来自研究的变量的过程仍然是一个主要瓶颈。

目的

目的是评估一种基于自然语言处理的方法,以自动实现变量协调,从而实现一种可扩展的多数据集整合方法。

方法

我们开发了一种全连接神经网络(FCN)方法,并通过对比学习进行增强,使用来自用于生物医学文本挖掘的双向编码器表征变换器语言表示模型的特定领域嵌入,使用3个心血管数据集:社区动脉粥样硬化风险研究、弗雷明汉心脏研究和动脉粥样硬化多族裔研究。我们将元数据变量描述和精心策划的协调概念用作基准事实。我们将该问题构建为配对句子分类任务。将该方法的准确性与逻辑回归基线方法进行比较。为了评估训练模型的通用性,我们还在准备训练集和验证集时通过分离这3个数据集来评估它们的性能。

结果

新开发的FCN实现了前5准确率为98.95%(95%CI 98.31%-99.47%),受试者工作特征曲线下面积(AUC)为0.99(95%CI 0.98-0.99),优于标准逻辑回归模型,该模型的前5准确率为22.23%(95%CI 19.91%-24.87%),AUC为0.82(95%CI 0.81-0.83)。对比学习增强也优于逻辑回归模型,尽管略低于基础FCN模型,前5准确率为89.88%(95%CI 87.88%-91.68%),AUC为0.98(95%CI 0.97-0.98)。

结论

这种新颖的方法为跨大规模队列研究协调元数据提供了一种可扩展的解决方案。所提出的方法通过使用学习到的表征更准确地对心血管疾病和中风队列的协调概念进行分类,显著提高了相对于基线方法的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33d4/12391522/45a99ba62409/formative-v9-e75608-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验