• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过集成机器学习对异构数据进行稳健的自动协调:算法开发与验证研究

Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study.

作者信息

Yang Doris, Zhou Doudou, Cai Steven, Gan Ziming, Pencina Michael, Avillach Paul, Cai Tianxi, Hong Chuan

机构信息

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.

Department of Statistics and Data Science, National University of Singapore, Singapore, Singapore.

出版信息

JMIR Med Inform. 2025 Jan 22;13:e54133. doi: 10.2196/54133.

DOI:10.2196/54133
PMID:39844378
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11778729/
Abstract

BACKGROUND

Cohort studies contain rich clinical data across large and diverse patient populations and are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmonize data from existing cohorts through multicohort studies. However, given differences in variable encoding, accurate variable harmonization is difficult.

OBJECTIVE

We propose SONAR (Semantic and Distribution-Based Harmonization) as a method for harmonizing variables across cohort studies to facilitate multicohort studies.

METHODS

SONAR used semantic learning from variable descriptions and distribution learning from study participant data. Our method learned an embedding vector for each variable and used pairwise cosine similarity to score the similarity between variables. This approach was built off 3 National Institutes of Health cohorts, including the Cardiovascular Health Study, the Multi-Ethnic Study of Atherosclerosis, and the Women's Health Initiative. We also used gold standard labels to further refine the embeddings in a supervised manner.

RESULTS

The method was evaluated using manually curated gold standard labels from the 3 National Institutes of Health cohorts. We evaluated both the intracohort and intercohort variable harmonization performance. The supervised SONAR method outperformed existing benchmark methods for almost all intracohort and intercohort comparisons using area under the curve and top-k accuracy metrics. Notably, SONAR was able to significantly improve harmonization of concepts that were difficult for existing semantic methods to harmonize.

CONCLUSIONS

SONAR achieves accurate variable harmonization within and between cohort studies by harnessing the complementary strengths of semantic learning and variable distribution learning.

摘要

背景

队列研究包含来自大量不同患者群体的丰富临床数据,是临床研究观察数据的常见来源。由于大规模队列研究既耗费时间又耗费资源,一种替代方法是通过多队列研究来整合现有队列的数据。然而,鉴于变量编码的差异,准确的变量整合很困难。

目的

我们提出SONAR(基于语义和分布的整合)作为一种在队列研究中整合变量以促进多队列研究的方法。

方法

SONAR使用来自变量描述的语义学习和来自研究参与者数据的分布学习。我们的方法为每个变量学习一个嵌入向量,并使用成对余弦相似度来衡量变量之间的相似度。这种方法基于3个美国国立卫生研究院的队列构建,包括心血管健康研究、动脉粥样硬化多民族研究和妇女健康倡议。我们还使用金标准标签以监督方式进一步优化嵌入。

结果

使用从3个美国国立卫生研究院队列中人工整理的金标准标签对该方法进行评估。我们评估了队列内和队列间的变量整合性能。使用曲线下面积和前k准确率指标,在几乎所有队列内和队列间比较中,有监督的SONAR方法优于现有的基准方法。值得注意的是,SONAR能够显著改善现有语义方法难以整合的概念的整合。

结论

SONAR通过利用语义学习和变量分布学习的互补优势,在队列研究内部和之间实现了准确的变量整合。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62be/11778729/2527b5e8351a/medinform-v13-e54133-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62be/11778729/c5c00124de6c/medinform-v13-e54133-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62be/11778729/110bb3d62ff3/medinform-v13-e54133-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62be/11778729/27ef0d5c8b96/medinform-v13-e54133-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62be/11778729/2527b5e8351a/medinform-v13-e54133-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62be/11778729/c5c00124de6c/medinform-v13-e54133-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62be/11778729/110bb3d62ff3/medinform-v13-e54133-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62be/11778729/27ef0d5c8b96/medinform-v13-e54133-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62be/11778729/2527b5e8351a/medinform-v13-e54133-g004.jpg

相似文献

1
Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study.通过集成机器学习对异构数据进行稳健的自动协调:算法开发与验证研究
JMIR Med Inform. 2025 Jan 22;13:e54133. doi: 10.2196/54133.
2
Multiview Incomplete Knowledge Graph Integration with application to cross-institutional EHR data harmonization.多视图不完整知识图集成及其在跨机构电子健康记录数据协调中的应用。
J Biomed Inform. 2022 Sep;133:104147. doi: 10.1016/j.jbi.2022.104147. Epub 2022 Jul 21.
3
Scalable relevance ranking algorithm via semantic similarity assessment improves efficiency of medical chart review.通过语义相似性评估的可扩展相关性排序算法提高了医学图表审查的效率。
J Biomed Inform. 2022 Aug;132:104109. doi: 10.1016/j.jbi.2022.104109. Epub 2022 Jun 1.
4
Automated feature selection of predictors in electronic medical records data.电子病历数据中预测指标的自动特征选择
Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.
5
Predicting Health Material Accessibility: Development of Machine Learning Algorithms.预测卫生材料可及性:机器学习算法的开发
JMIR Med Inform. 2021 Sep 1;9(9):e29175. doi: 10.2196/29175.
6
Validation and utility of ARDS subphenotypes identified by machine-learning models using clinical data: an observational, multicohort, retrospective analysis.基于机器学习模型的临床数据对 ARDS 亚表型的验证和实用性:一项观察性、多队列、回顾性分析。
Lancet Respir Med. 2022 Apr;10(4):367-377. doi: 10.1016/S2213-2600(21)00461-6. Epub 2022 Jan 10.
7
Classifying Stage IV Lung Cancer From Health Care Claims: A Comparison of Multiple Analytic Approaches.基于医疗保健理赔记录对IV期肺癌进行分类:多种分析方法的比较
JCO Clin Cancer Inform. 2019 May;3:1-19. doi: 10.1200/CCI.18.00156.
8
Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets - A feasibility study.语义搜索助手:一种基于在多项目问卷中使用嵌入作为合并大型数据集的协调机会的工具——一项可行性研究。
Eur Psychiatry. 2025 Jan 20;68(1):e8. doi: 10.1192/j.eurpsy.2024.1808.
9
A data-driven approach to predicting diabetes and cardiovascular disease with machine learning.基于机器学习的数据驱动方法预测糖尿病和心血管疾病。
BMC Med Inform Decis Mak. 2019 Nov 6;19(1):211. doi: 10.1186/s12911-019-0918-5.
10
Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning.使用监督机器学习将临床试验遗留数据半自动转换为 CDISC SDTM 标准格式。
Methods Inf Med. 2021 May;60(1-02):49-61. doi: 10.1055/s-0041-1731388. Epub 2021 Jul 8.

引用本文的文献

1
Automated Data Harmonization in Clinical Research: Natural Language Processing Approach.临床研究中的自动化数据协调:自然语言处理方法
JMIR Form Res. 2025 Aug 27;9:e75608. doi: 10.2196/75608.

本文引用的文献

1
Identifying Datasets for Cross-Study Analysis in dbGaP using PhenX.使用 PhenX 在 dbGaP 中识别用于跨研究分析的数据集。
Sci Data. 2022 Sep 1;9(1):532. doi: 10.1038/s41597-022-01660-4.
2
Multiview Incomplete Knowledge Graph Integration with application to cross-institutional EHR data harmonization.多视图不完整知识图集成及其在跨机构电子健康记录数据协调中的应用。
J Biomed Inform. 2022 Sep;133:104147. doi: 10.1016/j.jbi.2022.104147. Epub 2022 Jul 21.
3
CODER: Knowledge-infused cross-lingual medical term embedding for term normalization.
知识注入的跨语言医学术语嵌入用于术语归一化。
J Biomed Inform. 2022 Feb;126:103983. doi: 10.1016/j.jbi.2021.103983. Epub 2022 Jan 4.
4
A System for Phenotype Harmonization in the National Heart, Lung, and Blood Institute Trans-Omics for Precision Medicine (TOPMed) Program.国家心肺血液研究所精准医学转化组学(TOPMed)计划中的表型协调系统。
Am J Epidemiol. 2021 Oct 1;190(10):1977-1992. doi: 10.1093/aje/kwab115.
5
Reporting Guidelines, Review of Methodological Standards, and Challenges Toward Harmonization in Bone Marrow Adiposity Research. Report of the Methodologies Working Group of the International Bone Marrow Adiposity Society.骨内脂肪研究中的报告指南、方法学标准回顾及协调挑战。国际骨内脂肪协会方法学工作组的报告。
Front Endocrinol (Lausanne). 2020 Feb 28;11:65. doi: 10.3389/fendo.2020.00065. eCollection 2020.
6
Common Data Elements: Critical Assessment of Harmonization between Current Multi-Center Traumatic Brain Injury Studies.常见数据元素:当前多中心创伤性脑损伤研究之间协调的批判性评估。
J Neurotrauma. 2020 Jun 1;37(11):1283-1290. doi: 10.1089/neu.2019.6867. Epub 2020 Feb 25.
7
Harmonization of neuroimaging biomarkers for neurodegenerative diseases: A survey in the imaging community of perceived barriers and suggested actions.神经退行性疾病神经影像学生物标志物的协调统一:对影像学界感知到的障碍及建议行动的一项调查。
Alzheimers Dement (Amst). 2019 Jan 10;11:69-73. doi: 10.1016/j.dadm.2018.11.005. eCollection 2019 Dec.
8
Challenges to the Standardization of Trauma Data Collection in Burn, Traumatic Brain Injury, Spinal Cord Injury, and Other Trauma Populations: A Call for Common Data Elements for Acute and Longitudinal Trauma Databases.在烧伤、创伤性脑损伤、脊髓损伤和其他创伤人群中,创伤数据收集标准化面临的挑战:呼吁建立急性和纵向创伤数据库的通用数据元素。
Arch Phys Med Rehabil. 2019 May;100(5):891-898. doi: 10.1016/j.apmr.2018.10.004. Epub 2018 Oct 26.
9
Sharing and reuse of individual participant data from clinical trials: principles and recommendations.从临床试验中分享和重用个体参与者数据:原则和建议。
BMJ Open. 2017 Dec 14;7(12):e018647. doi: 10.1136/bmjopen-2017-018647.
10
Resource implications of preparing individual participant data from a clinical trial to share with external researchers.为与外部研究人员共享而准备来自临床试验的个体参与者数据所涉及的资源问题。
Trials. 2017 Jul 17;18(1):319. doi: 10.1186/s13063-017-2067-4.