Yoo Sooyoung, Choi Jinwook
Medical Information Center, Seoul National University Bundang Hospital, Seongnam, Korea.
Healthc Inform Res. 2011 Jun;17(2):120-30. doi: 10.4258/hir.2011.17.2.120. Epub 2011 Jun 30.
The purpose of this study was to investigate the effects of query expansion algorithms for MEDLINE retrieval within a pseudo-relevance feedback framework.
A number of query expansion algorithms were tested using various term ranking formulas, focusing on query expansion based on pseudo-relevance feedback. The OHSUMED test collection, which is a subset of the MEDLINE database, was used as a test corpus. Various ranking algorithms were tested in combination with different term re-weighting algorithms.
Our comprehensive evaluation showed that the local context analysis ranking algorithm, when used in combination with one of the reweighting algorithms - Rocchio, the probabilistic model, and our variants - significantly outperformed other algorithm combinations by up to 12% (paired t-test; p < 0.05). In a pseudo-relevance feedback framework, effective query expansion would be achieved by the careful consideration of term ranking and re-weighting algorithm pairs, at least in the context of the OHSUMED corpus.
Comparative experiments on term ranking algorithms were performed in the context of a subset of MEDLINE documents. With medical documents, local context analysis, which uses co-occurrence with all query terms, significantly outperformed various term ranking methods based on both frequency and distribution analyses. Furthermore, the results of the experiments demonstrated that the term rank-based re-weighting method contributed to a remarkable improvement in mean average precision.
本研究旨在调查在伪相关反馈框架内用于MEDLINE检索的查询扩展算法的效果。
使用各种词项排名公式测试了多种查询扩展算法,重点是基于伪相关反馈的查询扩展。将MEDLINE数据库的一个子集OHSUMED测试集用作测试语料库。将各种排名算法与不同的词项重新加权算法结合进行测试。
我们的综合评估表明,局部上下文分析排名算法与其中一种重新加权算法(罗基奥算法、概率模型以及我们的变体算法)结合使用时,显著优于其他算法组合,最高可达12%(配对t检验;p < 0.05)。在伪相关反馈框架中,至少在OHSUMED语料库的背景下,通过仔细考虑词项排名和重新加权算法对,可以实现有效的查询扩展。
在MEDLINE文档子集的背景下对词项排名算法进行了比较实验。对于医学文档,使用与所有查询词项共现的局部上下文分析明显优于基于频率和分布分析的各种词项排名方法。此外,实验结果表明基于词项排名的重新加权方法显著提高了平均准确率。