Suppr超能文献

使用Transformer模型和多标签分类的出版物类型标记

Publication Type Tagging using Transformer Models and Multi-Label Classification.

作者信息

Menke Joe D, Kilicoglu Halil, Smalheiser Neil R

机构信息

School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL.

Department of Psychiatry, University of Illinois Chicago, Chicago, IL.

出版信息

AMIA Annu Symp Proc. 2025 May 22;2024:818-827. eCollection 2024.

Abstract

Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant ( < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA.

摘要

根据文章的出版类型和研究设计进行索引,对于高效搜索和筛选生物医学文献至关重要,但与使用医学主题词表(MeSH)主题词进行索引相比,这方面的研究较少。在本研究中,我们利用PubMed中人工整理的出版类型和研究设计,生成了一个包含超过120万篇文章(标题和摘要)的数据集,并使用基于Transformer的先进模型对出版类型和研究设计进行自动标注。具体而言,我们使用多标签分类方法训练基于PubMedBERT的模型,并探索了欠采样、特征文字化和对比学习以提高模型性能。我们的结果表明,PubMedBERT为出版类型和研究设计索引提供了一个强大的基线;欠采样、特征文字化和无监督对比损失对性能有积极影响,而有监督对比学习则会降低性能。我们通过80%的欠采样和特征文字化获得了最佳的整体性能(宏F1值为0.632,宏AUC为0.969)。该模型在所有指标上均优于先前的模型(MultiTagger),性能差异具有统计学意义(<0.001)。尽管该模型性能更强,但仍有改进空间,未来的工作可以探索基于全文的特征以及模型可解释性。我们将数据和代码发布在https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA上。

相似文献

4
Recommending MeSH terms for annotating biomedical articles.推荐用于标注生物医学文章的 MeSH 术语。
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):660-7. doi: 10.1136/amiajnl-2010-000055. Epub 2011 May 25.
5
A recent advance in the automatic indexing of the biomedical literature.生物医学文献自动标引的最新进展。
J Biomed Inform. 2009 Oct;42(5):814-23. doi: 10.1016/j.jbi.2008.12.007. Epub 2008 Dec 30.

本文引用的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验