School of Information Technology, Middle Georgia State College, Macon, GA 31206, United States.
J Biomed Inform. 2013 Oct;46(5):929-39. doi: 10.1016/j.jbi.2013.07.006. Epub 2013 Jul 25.
Although biomedical information available in articles and patents is increasing exponentially, we continue to rely on the same information retrieval methods and use very few keywords to search millions of documents. We are developing a fundamentally different approach for finding much more precise and complete information with a single query using predicates instead of keywords for both query and document representation. Predicates are triples that are more complex datastructures than keywords and contain more structured information. To make optimal use of them, we developed a new predicate-based vector space model and query-document similarity function with adjusted tf-idf and boost function. Using a test bed of 107,367 PubMed abstracts, we evaluated the first essential function: retrieving information. Cancer researchers provided 20 realistic queries, for which the top 15 abstracts were retrieved using a predicate-based (new) and keyword-based (baseline) approach. Each abstract was evaluated, double-blind, by cancer researchers on a 0-5 point scale to calculate precision (0 versus higher) and relevance (0-5 score). Precision was significantly higher (p<.001) for the predicate-based (80%) than for the keyword-based (71%) approach. Relevance was almost doubled with the predicate-based approach-2.1 versus 1.6 without rank order adjustment (p<.001) and 1.34 versus 0.98 with rank order adjustment (p<.001) for predicate--versus keyword-based approach respectively. Predicates can support more precise searching than keywords, laying the foundation for rich and sophisticated information search.
虽然文章和专利中的生物医学信息呈指数级增长,但我们仍在继续依赖相同的信息检索方法,并使用很少的关键词搜索数百万份文档。我们正在开发一种截然不同的方法,通过使用谓词(而不是关键词)代替查询和文档表示中的关键词,用单个查询来查找更精确和完整的信息。谓词是比关键词更复杂的数据结构,包含更多结构化信息的三元组。为了充分利用它们,我们开发了一种新的基于谓词的向量空间模型和查询-文档相似性函数,调整了 tf-idf 和提升函数。我们使用了 107367 个 PubMed 摘要的测试平台来评估第一个基本功能:检索信息。癌症研究人员提供了 20 个实际查询,使用基于谓词的(新)和基于关键词的(基线)方法检索了前 15 个摘要。每个摘要都由癌症研究人员进行双盲评估,评分范围为 0-5 分,以计算精度(0 与更高)和相关性(0-5 分)。基于谓词的方法(80%)的精度明显高于基于关键词的方法(71%)(p<.001)。基于谓词的方法的相关性几乎提高了一倍-2.1 与 1.6(无排名调整时)(p<.001)和 1.34 与 0.98(有排名调整时)(p<.001),分别与关键词的方法。谓词可以支持比关键词更精确的搜索,为丰富和复杂的信息搜索奠定了基础。