MedCPT：利用大规模 PubMed 检索日志进行零样本生物医学信息检索的对比预训练 Transformer。

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval.

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States.

出版信息

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad651.

DOI:10.1093/bioinformatics/btad651

PMID:37930897

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10627406/

Abstract

MOTIVATION

Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine.

RESULTS

To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models, such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.

AVAILABILITY AND IMPLEMENTATION

The MedCPT code and model are available at https://github.com/ncbi/MedCPT.

摘要

动机

信息检索（IR）在生物医学知识获取和临床决策支持中至关重要。尽管最近的进展表明语言模型编码器在语义检索方面表现更好，但训练此类模型需要大量难以在生物医学领域获得的查询-文章注释。因此，大多数生物医学 IR 系统仅进行词汇匹配。为此，我们引入了 MedCPT，这是一种用于生物医学零样本语义 IR 的首创对比预训练 Transformer 模型。

结果

为了训练 MedCPT，我们从 PubMed 收集了前所未有的 2.55 亿用户点击日志。有了这些数据，我们使用对比学习来训练一对紧密集成的检索器和重新排序器。实验结果表明，MedCPT 在六个生物医学 IR 任务上均创下了新的最先进性能，优于各种基线，包括比 MedCPT 大得多的模型，如 GPT-3 大小的 cpt-text-XL。此外，MedCPT 还为语义评估生成了更好的生物医学文章和句子表示。因此，MedCPT 可以轻松应用于各种实际的生物医学 IR 任务。

可用性和实现

MedCPT 的代码和模型可在 https://github.com/ncbi/MedCPT 上获得。

相似文献

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval.

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad651.

A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

On cross-lingual retrieval with multilingual text encoders.

Inf Retr Boston. 2022;25(2):149-183. doi: 10.1007/s10791-022-09406-x. Epub 2022 Mar 7.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions.

Artif Intell Med. 2020 Jan;102:101767. doi: 10.1016/j.artmed.2019.101767. Epub 2019 Nov 28.

SIFR annotator: ontology-based semantic annotation of French biomedical text and clinical notes.

BMC Bioinformatics. 2018 Nov 6;19(1):405. doi: 10.1186/s12859-018-2429-2.

From zero to hero: Harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts.

Artif Intell Med. 2024 Oct;156:102970. doi: 10.1016/j.artmed.2024.102970. Epub 2024 Aug 24.

Methods Mol Biol. 2022;2496:221-235. doi: 10.1007/978-1-0716-2305-3_12.

Harnessing PubMed User Query Logs for Post Hoc Explanations of Recommended Similar Articles.

ArXiv. 2024 Feb 5:arXiv:2402.03484v1.

Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning.

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad496.

引用本文的文献

Accelerating clinical evidence synthesis with large language models.

NPJ Digit Med. 2025 Aug 8;8(1):509. doi: 10.1038/s41746-025-01840-7.

OmniCellAgent: Towards AI Co-Scientists for Scientific Discovery in Precision Medicine.

bioRxiv. 2025 Aug 2:2025.07.31.667797. doi: 10.1101/2025.07.31.667797.

Evaluating acute image ordering for real-world patient cases via language model alignment with radiological guidelines.

Commun Med (Lond). 2025 Aug 4;5(1):332. doi: 10.1038/s43856-025-01061-9.

A foundation model for human-AI collaboration in medical literature mining.

ArXiv. 2025 Jan 27:arXiv:2501.16255v1.

Recommending Clinical Trials for Online Patient Cases using Artificial Intelligence.

ArXiv. 2025 Apr 15:arXiv:2504.20059v1.

GeneAgent: self-verification language agent for gene-set analysis using domain databases.

Nat Methods. 2025 Jul 28. doi: 10.1038/s41592-025-02748-6.

A perspective for adapting generalist AI to specialized medical AI applications and their challenges.

NPJ Digit Med. 2025 Jul 11;8(1):429. doi: 10.1038/s41746-025-01789-7.

Knowledge-guided Contextual Gene Set Analysis Using Large Language Models.

ArXiv. 2025 Jun 4:arXiv:2506.04303v1.

Enhancing Patient-Trial Matching With Large Language Models: A Scoping Review of Emerging Applications and Approaches.

JCO Clin Cancer Inform. 2025 Jun;9:e2500071. doi: 10.1200/CCI-25-00071. Epub 2025 Jun 9.

Time Matters: Examine Temporal Effects on Biomedical Language Models.

AMIA Annu Symp Proc. 2025 May 22;2024:723-732. eCollection 2024.

本文引用的文献

Retrieve, Summarize, and Verify: How Will ChatGPT Affect Information Seeking from the Medical Literature?

J Am Soc Nephrol. 2023 Aug 1;34(8):1302-1304. doi: 10.1681/ASN.0000000000000166. Epub 2023 May 31.

Overview of the TREC 2020 Precision Medicine Track.

Text Retr Conf. 2020 Nov;1266.

Large expert-curated database for benchmarking document similarity detection in biomedical literature search.

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz085.

LitSense: making sense of biomedical literature at sentence level.

Nucleic Acids Res. 2019 Jul 2;47(W1):W594-W599. doi: 10.1093/nar/gkz289.

A survey on literature based discovery approaches in biomedical domain.

J Biomed Inform. 2019 May;93:103141. doi: 10.1016/j.jbi.2019.103141. Epub 2019 Mar 9.

Best Match: New relevance search for PubMed.

PLoS Biol. 2018 Aug 28;16(8):e2005343. doi: 10.1371/journal.pbio.2005343. eCollection 2018 Aug.

A Field Sensor: computing the composition and intent of PubMed queries.

Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay052.

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

Bioinformatics. 2017 Jul 15;33(14):i49-i58. doi: 10.1093/bioinformatics/btx238.

MIMIC-III, a freely accessible critical care database.

Sci Data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35.

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.

BMC Bioinformatics. 2015 Apr 30;16:138. doi: 10.1186/s12859-015-0564-6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

MedCPT：利用大规模 PubMed 检索日志进行零样本生物医学信息检索的对比预训练 Transformer。

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval.

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States.

出版信息

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad651.

DOI:10.1093/bioinformatics/btad651

PMID:37930897

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10627406/

Abstract

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

The MedCPT code and model are available at https://github.com/ncbi/MedCPT.

摘要

动机

结果

可用性和实现

MedCPT 的代码和模型可在 https://github.com/ncbi/MedCPT 上获得。

MedCPT：利用大规模 PubMed 检索日志进行零样本生物医学信息检索的对比预训练 Transformer。

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

MedCPT：利用大规模 PubMed 检索日志进行零样本生物医学信息检索的对比预训练 Transformer。

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现