Information Retrieval and Knowledge Management Research Lab, York University, Toronto, ON, M3J1P3, Canada.
BMC Bioinformatics. 2012 Jun 11;13 Suppl 9(Suppl 9):S2. doi: 10.1186/1471-2105-13-S9-S2.
The growth of the biomedical information requires most information retrieval systems to provide short and specific answers in response to complex user queries. Semantic information in the form of free text that is structured in a way makes it straightforward for humans to read but more difficult for computers to interpret automatically and search efficiently. One of the reasons is that most traditional information retrieval models assume terms are conditionally independent given a document/passage. Therefore, we are motivated to consider term associations within different contexts to help the models understand semantic information and use it for improving biomedical information retrieval performance.
We propose a term association approach to discover term associations among the keywords from a query. The experiments are conducted on the TREC 2004-2007 Genomics data sets and the TREC 2004 HARD data set. The proposed approach is promising and achieves superiority over the baselines and the GSP results. The parameter settings and different indices are investigated that the sentence-based index produces the best results in terms of the document-level, the word-based index for the best results in terms of the passage-level and the paragraph-based index for the best results in terms of the passage2-level. Furthermore, the best term association results always come from the best baseline. The tuning number k in the proposed recursive re-ranking algorithm is discussed and locally optimized to be 10.
First, modelling term association for improving biomedical information retrieval using factor analysis, is one of the major contributions in our work. Second, the experiments confirm that term association considering co-occurrence and dependency among the keywords can produce better results than the baselines treating the keywords independently. Third, the baselines are re-ranked according to the importance and reliance of latent factors behind term associations. These latent factors are decided by the proposed model and their term appearances in the first round retrieved passages.
生物医学信息的增长要求大多数信息检索系统能够针对复杂的用户查询提供简短而具体的答案。以人类易于阅读但计算机难以自动解释和高效搜索的方式结构化的自由文本形式的语义信息。其中一个原因是,大多数传统的信息检索模型假设在给定文档/段落的情况下术语是条件独立的。因此,我们有动力考虑不同上下文中的术语关联,以帮助模型理解语义信息并将其用于提高生物医学信息检索性能。
我们提出了一种术语关联方法来发现查询关键字之间的术语关联。实验是在 TREC 2004-2007 基因组学数据集和 TREC 2004 HARD 数据集上进行的。该方法具有很大的应用潜力,优于基线和 GSP 结果。我们研究了参数设置和不同的索引,结果表明基于句子的索引在文档级别上产生了最佳的结果,基于单词的索引在段落级别上产生了最佳的结果,基于段落的索引在段落 2 级别上产生了最佳的结果。此外,最佳的术语关联结果总是来自最佳的基线。还讨论并局部优化了所提出的递归重新排序算法中的调整数 k,使其为 10。
首先,使用因素分析来建模术语关联以改进生物医学信息检索是我们工作的主要贡献之一。其次,实验证实,考虑关键字之间的共现和依赖关系的术语关联可以产生比独立处理关键字的基线更好的结果。第三,根据术语关联背后潜在因素的重要性和依赖性,对基线进行重新排序。这些潜在因素是由所提出的模型和它们在第一轮检索段落中的术语出现决定的。