Sehgal Aditya K, Srinivasan Padmini
Department of Computer Science, The University of Iowa, Iowa City, IA 52246, USA.
BMC Bioinformatics. 2006 Apr 21;7:220. doi: 10.1186/1471-2105-7-220.
Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings.
Our two baseline ranking strategies are quite similar in performance. Two of our three LocusLink-based strategies offer significant improvements. These methods work very well even when there is ambiguity in the gene terms. Our best ranking strategy offers significant improvements on three different kinds of ambiguities over our two baseline strategies (improvements range from 15.9% to 17.7% and 11.7% to 13.3% depending on the baseline). For most genes the best ranking query is one that is built from the LocusLink (now Entrez Gene) summary and product information along with the gene names and aliases. For others, the gene names and aliases suffice. We also present an approach that successfully predicts, for a given gene, which of these two ranking queries is more appropriate.
We explore the effect of different post-retrieval strategies on the ranking of documents returned by PubMed for human gene queries. We have successfully applied some of these strategies to improve the ranking of relevant documents in the retrieved sets. This holds true even when various kinds of ambiguity are encountered. We feel that it would be very useful to apply strategies like ours on PubMed search results as these are not ordered by relevance in any way. This is especially so for queries that retrieve a large number of documents.
从MEDLINE中检索与基因查询相关的文献对于生物信息学中的许多应用至关重要。我们探索了五种基于信息检索的方法,用于对PubMed针对人类基因组的基因查询所检索到的文献进行排序。目的是在检索列表中将相关文献排在更高位置。我们应对了由于基因命名中的歧义所面临的特殊挑战:指代多个基因的基因术语、也是英语单词的基因术语以及具有其他生物学意义的基因术语。
我们的两种基线排序策略在性能上相当相似。我们基于LocusLink的三种策略中有两种带来了显著改进。即使基因术语存在歧义,这些方法也能很好地发挥作用。我们最佳的排序策略相对于我们的两种基线策略,在三种不同类型的歧义上都有显著改进(改进幅度在15.9%至17.7%以及11.7%至13.3%之间,具体取决于基线)。对于大多数基因,最佳的排序查询是基于LocusLink(现为Entrez Gene)摘要和产品信息以及基因名称和别名构建的。对于其他基因,基因名称和别名就足够了。我们还提出了一种方法,对于给定基因能够成功预测这两种排序查询中哪一种更合适。
我们探讨了不同的检索后策略对PubMed针对人类基因查询返回的文献排序的影响。我们已成功应用其中一些策略来改进检索集中相关文献的排序。即使遇到各种歧义情况,这一点依然成立。我们认为将我们这样的策略应用于PubMed搜索结果会非常有用,因为这些结果并非按相关性排序。对于检索到大量文献的查询尤其如此。