BMC Bioinformatics. 2014;15 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-15-S12-S1. Epub 2014 Nov 6.
Currently, most people use NCBI's PubMed to search the MEDLINE database, an important bibliographical information source for life science and biomedical information. However, PubMed has some drawbacks that make it difficult to find relevant publications pertaining to users' individual intentions, especially for non-expert users. To ameliorate the disadvantages of PubMed, we developed G-Bean, a graph based biomedical search engine, to search biomedical articles in MEDLINE database more efficiently.
G-Bean addresses PubMed's limitations with three innovations: (1) Parallel document index creation: a multithreaded index creation strategy is employed to generate the document index for G-Bean in parallel; (2) Ontology-graph based query expansion: an ontology graph is constructed by merging four major UMLS (Version 2013AA) vocabularies, MeSH, SNOMEDCT, CSP and AOD, to cover all concepts in National Library of Medicine (NLM) database; a Personalized PageRank algorithm is used to compute concept relevance in this ontology graph and the Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme is used to re-rank the concepts. The top 500 ranked concepts are selected for expanding the initial query to retrieve more accurate and relevant information; (3) Retrieval and re-ranking of documents based on user's search intention: after the user selects any article from the existing search results, G-Bean analyzes user's selections to determine his/her true search intention and then uses more relevant and more specific terms to retrieve additional related articles. The new articles are presented to the user in the order of their relevance to the already selected articles.
Performance evaluation with 106 OHSUMED benchmark queries shows that G-Bean returns more relevant results than PubMed does when using these queries to search the MEDLINE database. PubMed could not even return any search result for some OHSUMED queries because it failed to form the appropriate Boolean query statement automatically from the natural language query strings. G-Bean is available at http://bioinformatics.clemson.edu/G-Bean/index.php.
G-Bean addresses PubMed's limitations with ontology-graph based query expansion, automatic document indexing, and user search intention discovery. It shows significant advantages in finding relevant articles from the MEDLINE database to meet the information need of the user.
目前,大多数人使用 NCBI 的 PubMed 来搜索 MEDLINE 数据库,这是生命科学和生物医学信息的重要文献信息源。然而,PubMed 存在一些缺点,使得它难以找到与用户个体意图相关的出版物,尤其是对于非专业用户。为了改善 PubMed 的缺点,我们开发了基于图的生物医学搜索引擎 G-Bean,以更有效地搜索 MEDLINE 数据库中的生物医学文章。
G-Bean 通过三项创新来解决 PubMed 的局限性:(1)并行文档索引创建:采用多线程索引创建策略并行生成 G-Bean 的文档索引;(2)基于本体图的查询扩展:通过合并四个主要 UMLS(版本 2013AA)词汇表、MeSH、SNOMEDCT、CSP 和 AOD 来构建本体图,以覆盖 NLM 数据库中的所有概念;使用个性化 PageRank 算法计算本体图中的概念相关性,并使用词频-逆文档频率(TF-IDF)加权方案重新对概念进行排序。选择排名前 500 的概念来扩展初始查询,以检索更准确和相关的信息;(3)基于用户搜索意图的文档检索和重新排序:用户从现有搜索结果中选择任何文章后,G-Bean 会分析用户的选择,以确定他/她的真实搜索意图,然后使用更相关和更具体的术语来检索更多相关的文章。新文章按照与已选文章的相关性顺序呈现给用户。
使用 106 个 OHSUMED 基准查询进行性能评估表明,在使用这些查询搜索 MEDLINE 数据库时,G-Bean 返回的结果比 PubMed 更相关。PubMed 甚至无法为某些 OHSUMED 查询返回任何搜索结果,因为它无法自动从自然语言查询字符串中形成适当的布尔查询语句。G-Bean 可在 http://bioinformatics.clemson.edu/G-Bean/index.php 上获得。
G-Bean 通过基于本体图的查询扩展、自动文档索引和用户搜索意图发现来解决 PubMed 的局限性。它在从 MEDLINE 数据库中找到满足用户信息需求的相关文章方面表现出显著优势。