MedCPT:利用大规模 PubMed 检索日志进行零样本生物医学信息检索的对比预训练 Transformer。
MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval.
机构信息
National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States.
出版信息
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad651.
MOTIVATION
Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine.
RESULTS
To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models, such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.
AVAILABILITY AND IMPLEMENTATION
The MedCPT code and model are available at https://github.com/ncbi/MedCPT.
动机
信息检索(IR)在生物医学知识获取和临床决策支持中至关重要。尽管最近的进展表明语言模型编码器在语义检索方面表现更好,但训练此类模型需要大量难以在生物医学领域获得的查询-文章注释。因此,大多数生物医学 IR 系统仅进行词汇匹配。为此,我们引入了 MedCPT,这是一种用于生物医学零样本语义 IR 的首创对比预训练 Transformer 模型。
结果
为了训练 MedCPT,我们从 PubMed 收集了前所未有的 2.55 亿用户点击日志。有了这些数据,我们使用对比学习来训练一对紧密集成的检索器和重新排序器。实验结果表明,MedCPT 在六个生物医学 IR 任务上均创下了新的最先进性能,优于各种基线,包括比 MedCPT 大得多的模型,如 GPT-3 大小的 cpt-text-XL。此外,MedCPT 还为语义评估生成了更好的生物医学文章和句子表示。因此,MedCPT 可以轻松应用于各种实际的生物医学 IR 任务。
可用性和实现
MedCPT 的代码和模型可在 https://github.com/ncbi/MedCPT 上获得。