一种用于信息检索的多维度语义伪相关反馈框架。

A multi-dimensional semantic pseudo-relevance feedback framework for information retrieval.

作者信息

Pan Min, Liu Yu, Chen Jinguang, Huang Ellen Anne, Huang Jimmy X

机构信息

College of Computer and Information Engineering, Hubei Normal University, Huangshi, China.

School of Electronic Information, Huzhou College, Huzhou, China.

出版信息

Sci Rep. 2024 Dec 30;14(1):31806. doi: 10.1038/s41598-024-82871-0.

DOI:10.1038/s41598-024-82871-0

PMID:39738376

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11686017/

Abstract

Pre-trained models have garnered significant attention in the field of information retrieval, particularly for improving document ranking. Typically, an initial retrieval step using sparse methods such as BM25 is employed to obtain a set of pseudo-relevant documents, followed by re-ranking with a pre-trained model. However, the semantic information captured by pre-trained models from sentences or passages is usually only applied to document ranking, with limited use in query expansion. In fact, the semantic information within pseudo-relevant documents plays a critical role in selecting appropriate query expansion terms. Therefore, this paper proposes a novel approach that leverages pre-trained models to extract multi-dimensional semantic information from pseudo-relevant documents, offering more possibilities for query expansion. First, traditional sparse retrieval methods are used in the initial retrieval stage to ensure efficiency, and term-level weights are calculated based on statistical information. Then, the pre-trained model encodes both the query and the sentences and passages from the documents, extracting sentence-level and passage-level semantic similarities to the query. Finally, these semantic weights are combined with the term-level weights to generate an improved query for the second retrieval round. We conducted experiments on five TREC datasets and a medical dataset, showing improvements in official metrics such as MAP and P@10. The results demonstrate the effectiveness of utilizing multi-dimensional semantic information from pseudo-relevant documents to optimize query expansion. This study offers new insights into how the semantic information of pseudo-relevant documents can be effectively harnessed to enhance retrieval performance.

摘要

预训练模型在信息检索领域已获得广泛关注，特别是在改进文档排序方面。通常，会采用诸如BM25等稀疏方法进行初始检索步骤，以获取一组伪相关文档，随后使用预训练模型进行重新排序。然而，预训练模型从句子或段落中捕获的语义信息通常仅应用于文档排序，在查询扩展中的使用有限。实际上，伪相关文档中的语义信息在选择合适的查询扩展词方面起着关键作用。因此，本文提出了一种新颖的方法，该方法利用预训练模型从伪相关文档中提取多维度语义信息，为查询扩展提供了更多可能性。首先，在初始检索阶段使用传统的稀疏检索方法以确保效率，并基于统计信息计算词级权重。然后，预训练模型对查询以及文档中的句子和段落进行编码，提取与查询的句子级和段落级语义相似度。最后，将这些语义权重与词级权重相结合，生成用于第二轮检索的改进查询。我们在五个TREC数据集和一个医学数据集上进行了实验，结果显示在平均准确率均值（MAP）和前10召回率（P@10）等官方指标上有所提升。结果证明了利用伪相关文档中的多维度语义信息来优化查询扩展的有效性。这项研究为如何有效利用伪相关文档的语义信息以提高检索性能提供了新的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e94d/11686017/3b781d9916f0/41598_2024_82871_Fig1_HTML.jpg

相似文献

A multi-dimensional semantic pseudo-relevance feedback framework for information retrieval.一种用于信息检索的多维度语义伪相关反馈框架。

Sci Rep. 2024 Dec 30;14(1):31806. doi: 10.1038/s41598-024-82871-0.

Document Retrieval for Precision Medicine Using a Deep Learning Ensemble Method.使用深度学习集成方法进行精准医学的文献检索

JMIR Med Inform. 2021 Jun 29;9(6):e28272. doi: 10.2196/28272.

Learning to Refine Expansion Terms for Biomedical Information Retrieval using Semantic Resources.利用语义资源学习优化生物医学信息检索的扩展词项

IEEE/ACM Trans Comput Biol Bioinform. 2018 Feb 2. doi: 10.1109/TCBB.2018.2801303.

On the query reformulation technique for effective MEDLINE document retrieval.针对有效 MEDLINE 文档检索的查询改写技术。

J Biomed Inform. 2010 Oct;43(5):686-93. doi: 10.1016/j.jbi.2010.04.005. Epub 2010 Apr 13.

Effective matching of patients to clinical trials using entity extraction and neural re-ranking.使用实体抽取和神经再排序技术，有效匹配患者与临床试验。

J Biomed Inform. 2023 Aug;144:104444. doi: 10.1016/j.jbi.2023.104444. Epub 2023 Jul 13.

Evaluation of Term Ranking Algorithms for Pseudo-Relevance Feedback in MEDLINE Retrieval.医学文献数据库检索中用于伪相关反馈的术语排序算法评估

Healthc Inform Res. 2011 Jun;17(2):120-30. doi: 10.4258/hir.2011.17.2.120. Epub 2011 Jun 30.

An adaptive term proximity based rocchio's model for clinical decision support retrieval.基于自适应术语接近度的 Rocchio 模型在临床决策支持检索中的应用。

BMC Med Inform Decis Mak. 2019 Dec 12;19(Suppl 9):251. doi: 10.1186/s12911-019-0986-6.

Relevance Feedback Based Query Expansion Model Using Borda Count and Semantic Similarity Approach.基于Borda计数和语义相似性方法的相关反馈查询扩展模型

Comput Intell Neurosci. 2015;2015:568197. doi: 10.1155/2015/568197. Epub 2015 Dec 7.

Document/query expansion based on selecting significant concepts for context based retrieval of medical images.基于选择显著概念的文档/查询扩展，用于基于上下文的医学图像检索。

J Biomed Inform. 2019 Jul;95:103210. doi: 10.1016/j.jbi.2019.103210. Epub 2019 May 17.

Semantic concept-enriched dependence model for medical information retrieval.用于医学信息检索的语义概念增强依赖模型

J Biomed Inform. 2014 Feb;47:18-27. doi: 10.1016/j.jbi.2013.08.013. Epub 2013 Sep 11.

本文引用的文献

Semantic Search for Large Scale Clinical Ontologies.大规模临床本体的语义搜索。

AMIA Annu Symp Proc. 2022 Feb 21;2021:910-919. eCollection 2021.

An adaptive term proximity based rocchio's model for clinical decision support retrieval.基于自适应术语接近度的 Rocchio 模型在临床决策支持检索中的应用。

BMC Med Inform Decis Mak. 2019 Dec 12;19(Suppl 9):251. doi: 10.1186/s12911-019-0986-6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于信息检索的多维度语义伪相关反馈框架。

A multi-dimensional semantic pseudo-relevance feedback framework for information retrieval.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献