• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用Transformer模型和多标签分类的出版物类型标记

Publication Type Tagging using Transformer Models and Multi-Label Classification.

作者信息

Menke Joe D, Kilicoglu Halil, Smalheiser Neil R

机构信息

School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL.

Department of Psychiatry, University of Illinois Chicago, Chicago, IL.

出版信息

AMIA Annu Symp Proc. 2025 May 22;2024:818-827. eCollection 2024.

PMID:40417522
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12099436/
Abstract

Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant ( < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA.

摘要

根据文章的出版类型和研究设计进行索引,对于高效搜索和筛选生物医学文献至关重要,但与使用医学主题词表(MeSH)主题词进行索引相比,这方面的研究较少。在本研究中,我们利用PubMed中人工整理的出版类型和研究设计,生成了一个包含超过120万篇文章(标题和摘要)的数据集,并使用基于Transformer的先进模型对出版类型和研究设计进行自动标注。具体而言,我们使用多标签分类方法训练基于PubMedBERT的模型,并探索了欠采样、特征文字化和对比学习以提高模型性能。我们的结果表明,PubMedBERT为出版类型和研究设计索引提供了一个强大的基线;欠采样、特征文字化和无监督对比损失对性能有积极影响,而有监督对比学习则会降低性能。我们通过80%的欠采样和特征文字化获得了最佳的整体性能(宏F1值为0.632,宏AUC为0.969)。该模型在所有指标上均优于先前的模型(MultiTagger),性能差异具有统计学意义(<0.001)。尽管该模型性能更强,但仍有改进空间,未来的工作可以探索基于全文的特征以及模型可解释性。我们将数据和代码发布在https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA上。

相似文献

1
Publication Type Tagging using Transformer Models and Multi-Label Classification.使用Transformer模型和多标签分类的出版物类型标记
AMIA Annu Symp Proc. 2025 May 22;2024:818-827. eCollection 2024.
2
Publication Type Tagging using Transformer Models and Multi-Label Classification.使用Transformer模型和多标签分类的出版物类型标注
medRxiv. 2025 Mar 7:2025.03.06.25323516. doi: 10.1101/2025.03.06.25323516.
3
Enhancing automated indexing of publication types and study designs in biomedical literature using full-text features.利用全文特征增强生物医学文献中出版物类型和研究设计的自动索引。
medRxiv. 2025 Apr 28:2025.04.23.25326300. doi: 10.1101/2025.04.23.25326300.
4
Recommending MeSH terms for annotating biomedical articles.推荐用于标注生物医学文章的 MeSH 术语。
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):660-7. doi: 10.1136/amiajnl-2010-000055. Epub 2011 May 25.
5
A recent advance in the automatic indexing of the biomedical literature.生物医学文献自动标引的最新进展。
J Biomed Inform. 2009 Oct;42(5):814-23. doi: 10.1016/j.jbi.2008.12.007. Epub 2008 Dec 30.
6
MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank.医学主题词表现状:通过学习排序实现PubMed规模的自动医学主题词表索引编制。
J Biomed Semantics. 2017 Apr 17;8(1):15. doi: 10.1186/s13326-017-0123-3.
7
A bottom-up approach to MEDLINE indexing recommendations.一种自下而上的医学文献数据库(MEDLINE)索引推荐方法。
AMIA Annu Symp Proc. 2011;2011:1583-92. Epub 2011 Oct 22.
8
Synthetic Data-Driven Approaches for Chinese Medical Abstract Sentence Classification: Computational Study.用于中文医学摘要句子分类的合成数据驱动方法:计算研究
JMIR Form Res. 2025 Mar 19;9:e54803. doi: 10.2196/54803.
9
Fine-grained indexing of the biomedical literature: MeSH subheading attachment for a MEDLINE indexing tool.生物医学文献的细粒度索引:用于MEDLINE索引工具的医学主题词副主题词附加
AMIA Annu Symp Proc. 2007 Oct 11;2007:553-7.
10
Deterministic binary vectors for efficient automated indexing of MEDLINE/PubMed abstracts.用于MEDLINE/PubMed摘要高效自动索引的确定性二元向量
AMIA Annu Symp Proc. 2012;2012:940-9. Epub 2012 Nov 3.

引用本文的文献

1
Enhancing automated indexing of publication types and study designs in biomedical literature using full-text features.利用全文特征增强生物医学文献中出版物类型和研究设计的自动索引。
medRxiv. 2025 Apr 28:2025.04.23.25326300. doi: 10.1101/2025.04.23.25326300.
2
Issues regarding the Indexing of Adaptive Clinical Trial Articles.适应性临床试验文章的索引问题。
medRxiv. 2025 Mar 11:2025.03.10.25323694. doi: 10.1101/2025.03.10.25323694.

本文引用的文献

1
Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments.生物医学文献中实验模型的自动分类,以支持寻找替代动物实验的方法。
J Biomed Semantics. 2023 Sep 1;14(1):13. doi: 10.1186/s13326-023-00292-w.
2
Testing a filtering strategy for systematic reviews: evaluating work savings and recall.测试系统评价的过滤策略:评估工作节省和召回率。
AMIA Jt Summits Transl Sci Proc. 2022 May 23;2022:406-413. eCollection 2022.
3
Evaluation of publication type tagging as a strategy to screen randomized controlled trial articles in preparing systematic reviews.评估将出版物类型标注作为在准备系统评价时筛选随机对照试验文章的一种策略。
JAMIA Open. 2022 Mar 30;5(1):ooac015. doi: 10.1093/jamiaopen/ooac015. eCollection 2022 Apr.
4
A full systematic review was completed in 2 weeks using automation tools: a case study.在两周内使用自动化工具完成了全面的系统回顾:案例研究。
J Clin Epidemiol. 2020 May;121:81-90. doi: 10.1016/j.jclinepi.2020.01.008. Epub 2020 Jan 28.
5
Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database.设计一个通用的开放平台,用于在生物医学文献数据库PubMed中对文章进行机器学习辅助索引和聚类。
Data Inf Manag. 2018 Jun;2(1):27-36. doi: 10.2478/dim-2018-0004. Epub 2018 May 22.
6
A probabilistic automated tagger to identify human-related publications.一种用于识别与人相关出版物的概率自动标记器。
Database (Oxford). 2018 Jan 1;2018:1-8. doi: 10.1093/database/bay079.
7
MeSHLabeler and DeepMeSH: Recent Progress in Large-Scale MeSH Indexing.医学主题词标注器与深度医学主题词:大规模医学主题词标引的最新进展
Methods Mol Biol. 2018;1807:203-209. doi: 10.1007/978-1-4939-8561-6_15.
8
Machine learning for identifying Randomized Controlled Trials: An evaluation and practitioner's guide.机器学习在识别随机对照试验中的应用:评估与实践指南。
Res Synth Methods. 2018 Dec;9(4):602-614. doi: 10.1002/jrsm.1287. Epub 2018 Feb 7.
9
Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach.通过机器学习与众包相结合的方法识别随机对照试验(RCT)报告。
J Am Med Inform Assoc. 2017 Nov 1;24(6):1165-1168. doi: 10.1093/jamia/ocx053.
10
12 years on - Is the NLM medical text indexer still useful and relevant?十二年过去了——国立医学图书馆医学文本索引工具仍然有用吗?它还适用吗?
J Biomed Semantics. 2017 Feb 23;8(1):8. doi: 10.1186/s13326-017-0113-5.