迈向统一检索：利用全文提高 PubMed 的检索效果。

Towards a unified search: Improving PubMed retrieval with full text.

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.

出版信息

J Biomed Inform. 2022 Oct;134:104211. doi: 10.1016/j.jbi.2022.104211. Epub 2022 Sep 21.

DOI:10.1016/j.jbi.2022.104211

PMID:36152950

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9561061/

Abstract

OBJECTIVE

A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance.

MATERIALS AND METHODS

For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources - full text articles and abstract only articles - by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness.

RESULTS AND CONCLUSIONS

Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best.

摘要

目的

近期在 PubMed 上发表的大量文章都可以在 PubMed Central 中获得全文，并且全文的可用性一直在不断增加。然而，目前用户无法同时查询这两个数据库的内容并获得一个单一的综合搜索结果。在这项研究中，我们研究了如何对多词查询的全文文章进行评分，以及如何将这些全文文章的评分与来自摘要的评分相结合，从而实现整体检索性能的提高。

材料和方法

为了对全文文章进行评分，我们提出了一种方法，通过将传统使用的 BM25 评分转换为可以统一处理的对数几率评分，来组合来自不同部分的信息。我们进一步提出了一种方法，通过通过概率转换平衡各自评分的贡献，成功地将来自两个异构检索源（全文文章和仅摘要文章）的评分结合起来。我们使用从 PubMed 用户日志中采样的查询以及检索和点击的文档子集的 PubMed 点击数据来训练概率函数并评估检索效果。

结果和结论

随机排序在我们的 PubMed 点击数据上获得了 0.579 的 MAP 评分。在 PubMed 摘要上使用 BM25 排序可以将 MAP 提高 10.6%。对于全文文档，实验证实 BM25 部分评分的价值取决于部分类型，并且不能直接比较。简单地使用文章正文和摘要文本会降低搜索的整体质量。我们提出的对数几率评分标准化并组合了查询词在不同部分的出现的贡献。通过在可用的情况下包含全文，我们获得了 0.67%的增益，或者相对于仅摘要提高了 7%。我们发现，根据它们生成的部分，BM25 评分的准确性更高，这是一个优势。对三个部分的最高评分进行求和表现最好。

相似文献

Towards a unified search: Improving PubMed retrieval with full text.

J Biomed Inform. 2022 Oct;134:104211. doi: 10.1016/j.jbi.2022.104211. Epub 2022 Sep 21.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access.

J Biomed Inform. 2015 Oct;57:204-18. doi: 10.1016/j.jbi.2015.07.015. Epub 2015 Aug 1.

Is searching full text more effective than searching abstracts?

BMC Bioinformatics. 2009 Feb 3;10:46. doi: 10.1186/1471-2105-10-46.

An improved BM25 algorithm for clinical decision support in Precision Medicine based on co-word analysis and Cuckoo Search.

BMC Med Inform Decis Mak. 2021 Mar 2;21(1):81. doi: 10.1186/s12911-021-01454-5.

Using cited references to improve the retrieval of related biomedical documents.

BMC Bioinformatics. 2013 Mar 27;14:113. doi: 10.1186/1471-2105-14-113.

Learning to rank query expansion terms for COVID-19 scholarly search.

J Biomed Inform. 2023 Jun;142:104386. doi: 10.1016/j.jbi.2023.104386. Epub 2023 May 12.

How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information.

Clin Orthop Relat Res. 2024 Apr 1;482(4):578-588. doi: 10.1097/CORR.0000000000002995. Epub 2024 Mar 1.

Bridging the gap: Incorporating a semantic similarity measure for effectively mapping PubMed queries to documents.

J Biomed Inform. 2017 Nov;75:122-127. doi: 10.1016/j.jbi.2017.09.014. Epub 2017 Oct 3.

G-Bean: an ontology-graph based web tool for biomedical literature retrieval.

BMC Bioinformatics. 2014;15 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-15-S12-S1. Epub 2014 Nov 6.

引用本文的文献

ScRAPdb: an integrated pan-omics database for the Saccharomyces cerevisiae reference assembly panel.

Nucleic Acids Res. 2025 Jan 6;53(D1):D852-D863. doi: 10.1093/nar/gkae955.

Clinical Impact of "Real World Data" and Blockchain on Public Health: A Scoping Review.

Int J Environ Res Public Health. 2024 Jan 15;21(1):95. doi: 10.3390/ijerph21010095.

APPRAISE-RS: Automated, updated, participatory, and personalized treatment recommender systems based on GRADE methodology.

Heliyon. 2023 Jan 24;9(2):e13074. doi: 10.1016/j.heliyon.2023.e13074. eCollection 2023 Feb.

本文引用的文献

PubTator central: automated concept annotation for biomedical full text articles.

Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593. doi: 10.1093/nar/gkz389.

LitSense: making sense of biomedical literature at sentence level.

Nucleic Acids Res. 2019 Jul 2;47(W1):W594-W599. doi: 10.1093/nar/gkz289.

PMC text mining subset in BioC: about three million full-text articles and growing.

Bioinformatics. 2019 Sep 15;35(18):3533-3535. doi: 10.1093/bioinformatics/btz070.

How user intelligence is improving PubMed.

Nat Biotechnol. 2018 Oct 1. doi: 10.1038/nbt.4267.

Best Match: New relevance search for PubMed.

PLoS Biol. 2018 Aug 28;16(8):e2005343. doi: 10.1371/journal.pbio.2005343. eCollection 2018 Aug.

PubMed Phrases, an open set of coherent phrases for searching biomedical literature.

Sci Data. 2018 Jun 12;5:180104. doi: 10.1038/sdata.2018.104.

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.

PLoS Comput Biol. 2018 Feb 15;14(2):e1005962. doi: 10.1371/journal.pcbi.1005962. eCollection 2018 Feb.

Towards PubMed 2.0.

Elife. 2017 Oct 30;6:e28801. doi: 10.7554/eLife.28801.

A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering.

J Biomed Inform. 2017 Apr;68:96-103. doi: 10.1016/j.jbi.2017.03.001. Epub 2017 Mar 7.

Extending the evaluation of Genia Event task toward knowledge base construction and comparison to Gene Regulation Ontology task.

BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S3. doi: 10.1186/1471-2105-16-S10-S3. Epub 2015 Jul 13.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

迈向统一检索：利用全文提高 PubMed 的检索效果。

Towards a unified search: Improving PubMed retrieval with full text.

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS AND CONCLUSIONS

目的

材料和方法

结果和结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献