Gottfried Karl, Janson Karina, Holz Nathalie E, Reis Olaf, Kornhuber Johannes, Eichler Anna, Banaschewski Tobias, Nees Frauke
Institute of Applied Medical Informatics, University Hospital Center Hamburg-Eppendorf, Hamburg, Germany.
Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany.
Eur Psychiatry. 2025 Jan 20;68(1):e8. doi: 10.1192/j.eurpsy.2024.1808.
Recent advances in natural language processing (NLP), particularly in language processing methods, have opened new avenues in semantic data analysis. A promising application of NLP is data harmonization in questionnaire-based cohort studies, where it can be used as an additional method, specifically when only different instruments are available for one construct as well as for the evaluation of potentially new construct-constellations. The present article therefore explores embedding models' potential to detect opportunities for semantic harmonization.
Using models like SBERT and OpenAI's ADA, we developed a prototype application ("Semantic Search Helper") to facilitate the harmonization process of detecting semantically similar items within extensive health-related datasets. The approach's feasibility and applicability were evaluated through a use case analysis involving data from four large cohort studies with heterogeneous data obtained with a different set of instruments for common constructs.
With the prototype, we effectively identified potential harmonization pairs, which significantly reduced manual evaluation efforts. Expert ratings of semantic similarity candidates showed high agreement with model-generated pairs, confirming the validity of our approach.
This study demonstrates the potential of embeddings in matching semantic similarity as a promising add-on tool to assist harmonization processes of multiplex data sets and instruments but with similar content, within and across studies.
自然语言处理(NLP)的最新进展,特别是在语言处理方法方面,为语义数据分析开辟了新途径。NLP的一个有前景的应用是基于问卷的队列研究中的数据协调,在这种研究中,它可以作为一种额外的方法使用,特别是当对于一个构念只有不同的工具可用时,以及用于评估潜在的新构念组合时。因此,本文探讨了嵌入模型在检测语义协调机会方面的潜力。
我们使用SBERT和OpenAI的ADA等模型,开发了一个原型应用程序(“语义搜索助手”),以促进在大量与健康相关的数据集中检测语义相似项目的协调过程。通过一个用例分析评估了该方法的可行性和适用性,该用例分析涉及来自四项大型队列研究的数据,这些研究使用不同的工具集获取了关于常见构念的异构数据。
通过该原型,我们有效地识别出了潜在的协调对,这显著减少了人工评估工作量。语义相似性候选对的专家评级与模型生成的对高度一致,证实了我们方法的有效性。
本研究证明了嵌入在匹配语义相似性方面的潜力,作为一种有前景的附加工具,可协助在研究内部和研究之间对具有相似内容的多重数据集和工具进行协调过程。