Suppr超能文献

语义搜索助手:一种基于在多项目问卷中使用嵌入作为合并大型数据集的协调机会的工具——一项可行性研究。

Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets - A feasibility study.

作者信息

Gottfried Karl, Janson Karina, Holz Nathalie E, Reis Olaf, Kornhuber Johannes, Eichler Anna, Banaschewski Tobias, Nees Frauke

机构信息

Institute of Applied Medical Informatics, University Hospital Center Hamburg-Eppendorf, Hamburg, Germany.

Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany.

出版信息

Eur Psychiatry. 2025 Jan 20;68(1):e8. doi: 10.1192/j.eurpsy.2024.1808.

Abstract

BACKGROUND

Recent advances in natural language processing (NLP), particularly in language processing methods, have opened new avenues in semantic data analysis. A promising application of NLP is data harmonization in questionnaire-based cohort studies, where it can be used as an additional method, specifically when only different instruments are available for one construct as well as for the evaluation of potentially new construct-constellations. The present article therefore explores embedding models' potential to detect opportunities for semantic harmonization.

METHODS

Using models like SBERT and OpenAI's ADA, we developed a prototype application ("Semantic Search Helper") to facilitate the harmonization process of detecting semantically similar items within extensive health-related datasets. The approach's feasibility and applicability were evaluated through a use case analysis involving data from four large cohort studies with heterogeneous data obtained with a different set of instruments for common constructs.

RESULTS

With the prototype, we effectively identified potential harmonization pairs, which significantly reduced manual evaluation efforts. Expert ratings of semantic similarity candidates showed high agreement with model-generated pairs, confirming the validity of our approach.

CONCLUSIONS

This study demonstrates the potential of embeddings in matching semantic similarity as a promising add-on tool to assist harmonization processes of multiplex data sets and instruments but with similar content, within and across studies.

摘要

背景

自然语言处理(NLP)的最新进展,特别是在语言处理方法方面,为语义数据分析开辟了新途径。NLP的一个有前景的应用是基于问卷的队列研究中的数据协调,在这种研究中,它可以作为一种额外的方法使用,特别是当对于一个构念只有不同的工具可用时,以及用于评估潜在的新构念组合时。因此,本文探讨了嵌入模型在检测语义协调机会方面的潜力。

方法

我们使用SBERT和OpenAI的ADA等模型,开发了一个原型应用程序(“语义搜索助手”),以促进在大量与健康相关的数据集中检测语义相似项目的协调过程。通过一个用例分析评估了该方法的可行性和适用性,该用例分析涉及来自四项大型队列研究的数据,这些研究使用不同的工具集获取了关于常见构念的异构数据。

结果

通过该原型,我们有效地识别出了潜在的协调对,这显著减少了人工评估工作量。语义相似性候选对的专家评级与模型生成的对高度一致,证实了我们方法的有效性。

结论

本研究证明了嵌入在匹配语义相似性方面的潜力,作为一种有前景的附加工具,可协助在研究内部和研究之间对具有相似内容的多重数据集和工具进行协调过程。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验