语义搜索助手：一种基于在多项目问卷中使用嵌入作为合并大型数据集的协调机会的工具——一项可行性研究。

Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets - A feasibility study.

作者信息

Gottfried Karl, Janson Karina, Holz Nathalie E, Reis Olaf, Kornhuber Johannes, Eichler Anna, Banaschewski Tobias, Nees Frauke

机构信息

Institute of Applied Medical Informatics, University Hospital Center Hamburg-Eppendorf, Hamburg, Germany.

Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany.

出版信息

Eur Psychiatry. 2025 Jan 20;68(1):e8. doi: 10.1192/j.eurpsy.2024.1808.

DOI:10.1192/j.eurpsy.2024.1808

PMID:39831376

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11795448/

Abstract

BACKGROUND

Recent advances in natural language processing (NLP), particularly in language processing methods, have opened new avenues in semantic data analysis. A promising application of NLP is data harmonization in questionnaire-based cohort studies, where it can be used as an additional method, specifically when only different instruments are available for one construct as well as for the evaluation of potentially new construct-constellations. The present article therefore explores embedding models' potential to detect opportunities for semantic harmonization.

METHODS

Using models like SBERT and OpenAI's ADA, we developed a prototype application ("Semantic Search Helper") to facilitate the harmonization process of detecting semantically similar items within extensive health-related datasets. The approach's feasibility and applicability were evaluated through a use case analysis involving data from four large cohort studies with heterogeneous data obtained with a different set of instruments for common constructs.

RESULTS

With the prototype, we effectively identified potential harmonization pairs, which significantly reduced manual evaluation efforts. Expert ratings of semantic similarity candidates showed high agreement with model-generated pairs, confirming the validity of our approach.

CONCLUSIONS

This study demonstrates the potential of embeddings in matching semantic similarity as a promising add-on tool to assist harmonization processes of multiplex data sets and instruments but with similar content, within and across studies.

摘要

背景

自然语言处理（NLP）的最新进展，特别是在语言处理方法方面，为语义数据分析开辟了新途径。NLP的一个有前景的应用是基于问卷的队列研究中的数据协调，在这种研究中，它可以作为一种额外的方法使用，特别是当对于一个构念只有不同的工具可用时，以及用于评估潜在的新构念组合时。因此，本文探讨了嵌入模型在检测语义协调机会方面的潜力。

方法

我们使用SBERT和OpenAI的ADA等模型，开发了一个原型应用程序（“语义搜索助手”），以促进在大量与健康相关的数据集中检测语义相似项目的协调过程。通过一个用例分析评估了该方法的可行性和适用性，该用例分析涉及来自四项大型队列研究的数据，这些研究使用不同的工具集获取了关于常见构念的异构数据。

结果

通过该原型，我们有效地识别出了潜在的协调对，这显著减少了人工评估工作量。语义相似性候选对的专家评级与模型生成的对高度一致，证实了我们方法的有效性。

结论

本研究证明了嵌入在匹配语义相似性方面的潜力，作为一种有前景的附加工具，可协助在研究内部和研究之间对具有相似内容的多重数据集和工具进行协调过程。

相似文献

Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets - A feasibility study.语义搜索助手：一种基于在多项目问卷中使用嵌入作为合并大型数据集的协调机会的工具——一项可行性研究。

Eur Psychiatry. 2025 Jan 20;68(1):e8. doi: 10.1192/j.eurpsy.2024.1808.

Multi-Ontology Refined Embeddings (MORE): A hybrid multi-ontology and corpus-based semantic representation model for biomedical concepts.多本体精炼嵌入模型（MORE）：一种基于混合多本体和语料库的生物医学概念语义表示模型。

J Biomed Inform. 2020 Nov;111:103581. doi: 10.1016/j.jbi.2020.103581. Epub 2020 Oct 1.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Integration of large-scale community-developed causal loop diagrams: a Natural Language Processing approach to merging factors based on semantic similarity.大规模社区开发的因果循环图整合：一种基于语义相似性合并因素的自然语言处理方法。

BMC Public Health. 2025 Mar 8;25(1):923. doi: 10.1186/s12889-025-22142-3.

Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data.利用自然语言处理促进心理健康问卷的协调一致：一项使用真实世界数据的验证研究。

BMC Psychiatry. 2024 Jul 24;24(1):530. doi: 10.1186/s12888-024-05954-2.

Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.使用词和图嵌入来衡量统一医学语言系统概念之间的语义相关性。

J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.

User-centered semantic harmonization: a case study.以用户为中心的语义协调：一个案例研究。

J Biomed Inform. 2007 Jun;40(3):353-64. doi: 10.1016/j.jbi.2007.03.004. Epub 2007 Mar 21.

Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information.利用分布式语义学和本体论信息提高泰语语义相似度的现有水平。

PLoS One. 2021 Feb 17;16(2):e0246751. doi: 10.1371/journal.pone.0246751. eCollection 2021.

Nine Principles of Semantic Harmonization.语义协调的九条原则。

AMIA Annu Symp Proc. 2017 Feb 10;2016:451-459. eCollection 2016.

Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis.支持语义分析的临床自然语言处理的最新进展。

Yearb Med Inform. 2015 Aug 13;10(1):183-93. doi: 10.15265/IY-2015-009.

引用本文的文献

ItemComplex: A Python-based visualization framework for ex-post organization and integration of large language-based datasets.ItemComplex：一个基于Python的可视化框架，用于事后组织和整合基于大语言的数据集。

Eur Psychiatry. 2025 May 26;68(1):e75. doi: 10.1192/j.eurpsy.2025.2457.

本文引用的文献

Progressing "Positive Epidemiology": A Cross-national Analysis of Adolescents' Positive Mental Health and Outcomes During the COVID-19 Pandemic.推进“积极流行病学”：新冠疫情期间青少年积极心理健康与结果的跨国分析。

Epidemiology. 2025 Jan 1;36(1):28-39. doi: 10.1097/EDE.0000000000001798. Epub 2024 Oct 22.

BMC Psychiatry. 2024 Jul 24;24(1):530. doi: 10.1186/s12888-024-05954-2.

Overcoming Data Gaps in Life Course Epidemiology by Matching Across Cohorts.通过队列间匹配克服生命历程流行病学中的数据缺口。

Epidemiology. 2024 Sep 1;35(5):610-617. doi: 10.1097/EDE.0000000000001761. Epub 2024 Jul 5.

A stable and replicable neural signature of lifespan adversity in the adult brain.成年人大脑中与寿命逆境相关的稳定且可复制的神经特征。

Nat Neurosci. 2023 Sep;26(9):1603-1612. doi: 10.1038/s41593-023-01410-8. Epub 2023 Aug 21.

Evaluating the harmonisation potential of diverse cohort datasets.评估不同队列数据集的协调潜力。

Eur J Epidemiol. 2023 Jun;38(6):605-615. doi: 10.1007/s10654-023-00997-3. Epub 2023 Apr 26.

Statistical harmonization of everyday functioning and dementia-related behavioral measures across nine surveys and trials.九项调查和试验中日常功能及痴呆相关行为指标的统计协调。

Alzheimers Dement (Amst). 2023 Mar 15;15(1):e12412. doi: 10.1002/dad2.12412. eCollection 2023 Jan-Mar.

Better together: Advancing life course research through multi-cohort analytic approaches.携手共进：通过多队列分析方法推进生命历程研究。

Adv Life Course Res. 2022 Sep;53:100499. doi: 10.1016/j.alcr.2022.100499. Epub 2022 Jul 18.

The ILHBN: challenges, opportunities, and solutions from harmonizing data under heterogeneous study designs, target populations, and measurement protocols.ILHBN：在异构的研究设计、目标人群和测量方案下协调数据所面临的挑战、机遇和解决方案。

Transl Behav Med. 2023 Jan 20;13(1):7-16. doi: 10.1093/tbm/ibac069.

A natural language processing approach towards harmonisation of European medicinal product information.自然语言处理方法在欧洲药品信息协调中的应用。

PLoS One. 2022 Oct 20;17(10):e0275386. doi: 10.1371/journal.pone.0275386. eCollection 2022.

ADataViewer: exploring semantically harmonized Alzheimer's disease cohort datasets.ADataViewer：探索语义协调的阿尔茨海默病队列数据集。

Alzheimers Res Ther. 2022 May 21;14(1):69. doi: 10.1186/s13195-022-01009-4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

语义搜索助手：一种基于在多项目问卷中使用嵌入作为合并大型数据集的协调机会的工具——一项可行性研究。

Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets - A feasibility study.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献