Suppr超能文献

一种用于信息检索的多维度语义伪相关反馈框架。

A multi-dimensional semantic pseudo-relevance feedback framework for information retrieval.

作者信息

Pan Min, Liu Yu, Chen Jinguang, Huang Ellen Anne, Huang Jimmy X

机构信息

College of Computer and Information Engineering, Hubei Normal University, Huangshi, China.

School of Electronic Information, Huzhou College, Huzhou, China.

出版信息

Sci Rep. 2024 Dec 30;14(1):31806. doi: 10.1038/s41598-024-82871-0.

Abstract

Pre-trained models have garnered significant attention in the field of information retrieval, particularly for improving document ranking. Typically, an initial retrieval step using sparse methods such as BM25 is employed to obtain a set of pseudo-relevant documents, followed by re-ranking with a pre-trained model. However, the semantic information captured by pre-trained models from sentences or passages is usually only applied to document ranking, with limited use in query expansion. In fact, the semantic information within pseudo-relevant documents plays a critical role in selecting appropriate query expansion terms. Therefore, this paper proposes a novel approach that leverages pre-trained models to extract multi-dimensional semantic information from pseudo-relevant documents, offering more possibilities for query expansion. First, traditional sparse retrieval methods are used in the initial retrieval stage to ensure efficiency, and term-level weights are calculated based on statistical information. Then, the pre-trained model encodes both the query and the sentences and passages from the documents, extracting sentence-level and passage-level semantic similarities to the query. Finally, these semantic weights are combined with the term-level weights to generate an improved query for the second retrieval round. We conducted experiments on five TREC datasets and a medical dataset, showing improvements in official metrics such as MAP and P@10. The results demonstrate the effectiveness of utilizing multi-dimensional semantic information from pseudo-relevant documents to optimize query expansion. This study offers new insights into how the semantic information of pseudo-relevant documents can be effectively harnessed to enhance retrieval performance.

摘要

预训练模型在信息检索领域已获得广泛关注,特别是在改进文档排序方面。通常,会采用诸如BM25等稀疏方法进行初始检索步骤,以获取一组伪相关文档,随后使用预训练模型进行重新排序。然而,预训练模型从句子或段落中捕获的语义信息通常仅应用于文档排序,在查询扩展中的使用有限。实际上,伪相关文档中的语义信息在选择合适的查询扩展词方面起着关键作用。因此,本文提出了一种新颖的方法,该方法利用预训练模型从伪相关文档中提取多维度语义信息,为查询扩展提供了更多可能性。首先,在初始检索阶段使用传统的稀疏检索方法以确保效率,并基于统计信息计算词级权重。然后,预训练模型对查询以及文档中的句子和段落进行编码,提取与查询的句子级和段落级语义相似度。最后,将这些语义权重与词级权重相结合,生成用于第二轮检索的改进查询。我们在五个TREC数据集和一个医学数据集上进行了实验,结果显示在平均准确率均值(MAP)和前10召回率(P@10)等官方指标上有所提升。结果证明了利用伪相关文档中的多维度语义信息来优化查询扩展的有效性。这项研究为如何有效利用伪相关文档的语义信息以提高检索性能提供了新的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e94d/11686017/3b781d9916f0/41598_2024_82871_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验