Xu Bo, Lin Hongfei, Lin Yuan
IEEE/ACM Trans Comput Biol Bioinform. 2018 Feb 2. doi: 10.1109/TCBB.2018.2801303.
With the rapid development of biomedicine, the number of biomedical articles has increased accordingly, which presents a great challenge for biologists trying to keep up with the latest research. Information retrieval technologies seek to meet this challenge by searching among a large number of articles based on a given query and providing the most relevant ones to fulfill information needs. As an effective information retrieval technique, query expansion has some room for improvement to achieve the desired performance when directly applied for biomedical information retrieval because there exist many domain-related terms both in users' queries and in related articles. To solve this problem, we propose a biomedical query expansion framework based on learning-to-rank methods, in which we refine the candidate expansion terms by training term-ranking models to select the most relevant terms for enriching the original query. To train the term-ranking models, we first propose a pseudo-relevance feedback method based on MeSH to select candidate expansion terms and then represent the candidate terms as feature vectors by defining both the corpus-based term features and the resource-based term features. Experimental results obtained for TREC genomics datasets show that our method can capture more relevant terms to expand the original query and effectively improve biomedical information retrieval performance.
随着生物医药的快速发展,生物医药文章的数量相应增加,这给试图跟上最新研究的生物学家带来了巨大挑战。信息检索技术试图通过基于给定查询在大量文章中进行搜索并提供最相关的文章来满足信息需求,以应对这一挑战。作为一种有效的信息检索技术,查询扩展在直接应用于生物医药信息检索时,为实现理想性能仍有改进空间,因为用户查询和相关文章中都存在许多领域相关术语。为解决这一问题,我们提出一种基于排序学习方法的生物医药查询扩展框架,在该框架中,我们通过训练术语排序模型来优化候选扩展词,以选择最相关的术语来丰富原始查询。为训练术语排序模型,我们首先提出一种基于医学主题词表(MeSH)的伪相关反馈方法来选择候选扩展词,然后通过定义基于语料库的术语特征和基于资源的术语特征,将候选词表示为特征向量。在TREC基因组学数据集上获得的实验结果表明,我们的方法能够捕捉到更多相关术语来扩展原始查询,并有效提高生物医药信息检索性能。