National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
J Biomed Inform. 2022 Oct;134:104211. doi: 10.1016/j.jbi.2022.104211. Epub 2022 Sep 21.
A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance.
For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources - full text articles and abstract only articles - by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness.
Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best.
近期在 PubMed 上发表的大量文章都可以在 PubMed Central 中获得全文,并且全文的可用性一直在不断增加。然而,目前用户无法同时查询这两个数据库的内容并获得一个单一的综合搜索结果。在这项研究中,我们研究了如何对多词查询的全文文章进行评分,以及如何将这些全文文章的评分与来自摘要的评分相结合,从而实现整体检索性能的提高。
为了对全文文章进行评分,我们提出了一种方法,通过将传统使用的 BM25 评分转换为可以统一处理的对数几率评分,来组合来自不同部分的信息。我们进一步提出了一种方法,通过通过概率转换平衡各自评分的贡献,成功地将来自两个异构检索源(全文文章和仅摘要文章)的评分结合起来。我们使用从 PubMed 用户日志中采样的查询以及检索和点击的文档子集的 PubMed 点击数据来训练概率函数并评估检索效果。
随机排序在我们的 PubMed 点击数据上获得了 0.579 的 MAP 评分。在 PubMed 摘要上使用 BM25 排序可以将 MAP 提高 10.6%。对于全文文档,实验证实 BM25 部分评分的价值取决于部分类型,并且不能直接比较。简单地使用文章正文和摘要文本会降低搜索的整体质量。我们提出的对数几率评分标准化并组合了查询词在不同部分的出现的贡献。通过在可用的情况下包含全文,我们获得了 0.67%的增益,或者相对于仅摘要提高了 7%。我们发现,根据它们生成的部分,BM25 评分的准确性更高,这是一个优势。对三个部分的最高评分进行求和表现最好。