• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

当 BERT 遇见比尔博:预训练语言模型在疾病分类上的学习曲线分析。

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.

机构信息

College of Computer Science, Sichuan University, Chengdu, China.

MobLab Inc., Pasadena, CA, USA.

出版信息

BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.

DOI:10.1186/s12911-022-01829-2
PMID:35382811
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8981604/
Abstract

BACKGROUND

Natural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP.

METHOD

We conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse and dense bag-of-words models, long short-term memory networks, and their variants that leveraged external knowledge. To obtain learning curves, we incremented the amount of training examples per disease from small to large, and measured the classification performance in macro-averaged [Formula: see text] score.

RESULTS

On the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bag-of-words features.

CONCLUSION

As long as the amount of training documents is not extremely few, fine-tuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bag-of-words features shall be considered.

摘要

背景

由于标注成本高和自然稀有观察,健康领域的自然语言处理(NLP)任务通常处理的是有限数量的标记数据。为了弥补训练数据的不足,健康 NLP 研究人员经常不得不利用任务之外的知识和资源。最近,像 Transformer 中的双向编码器表示(BERT)这样的预训练的大规模语言模型已被证明是一种从大量未标记文本中学习丰富语言知识并将该知识转移到下游任务的强大方法。然而,以前的下游任务通常使用的训练数据规模如此之大,以至于在健康领域不太可能获得。在这项工作中,我们旨在研究在健康 NLP 中训练数据相对较小时,BERT 是否仍然可以受益于下游任务。

方法

我们进行了学习曲线分析,以研究 BERT 和基线模型随训练数据大小增加的行为。我们观察了这些模型在两个疾病诊断数据集上的分类性能,其中一些疾病自然罕见,观察到的病例非常有限(少于 10000 例中的 2 例)。基线包括常用的文本分类模型,如稀疏和密集词袋模型、长短时记忆网络及其利用外部知识的变体。为了获得学习曲线,我们按从小到大的顺序增加每个疾病的训练示例数量,并以宏平均 [Formula: see text] 分数衡量分类性能。

结果

在所有疾病的分类任务中,BERT 的学习曲线始终高于所有基线,在整个训练数据规模范围内都显著优于它们。但是,在每种疾病只有一两个训练文档的极端情况下,BERT 的性能逊于经过精心设计的词袋特征的线性分类器。

结论

只要训练文档的数量不是非常少,微调预训练的 BERT 模型是疾病分类等健康 NLP 任务的一种非常有效的方法。然而,在每种疾病只有一两个训练文档且没有更多文档的极端情况下,应考虑使用基于词袋特征的简单线性模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/57e3/8981604/406a4182cb97/12911_2022_1829_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/57e3/8981604/cfa99ce029c5/12911_2022_1829_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/57e3/8981604/b6d78a6e387a/12911_2022_1829_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/57e3/8981604/406a4182cb97/12911_2022_1829_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/57e3/8981604/cfa99ce029c5/12911_2022_1829_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/57e3/8981604/b6d78a6e387a/12911_2022_1829_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/57e3/8981604/406a4182cb97/12911_2022_1829_Fig3_HTML.jpg

相似文献

1
When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.当 BERT 遇见比尔博:预训练语言模型在疾病分类上的学习曲线分析。
BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.
2
Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。
Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.
3
Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.使用暹罗神经网络的临床自然语言处理少样本学习:算法开发与验证研究
JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.
4
Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing.利用基于深度学习的自然语言处理技术从非结构化电子健康记录中分类社会健康决定因素。
J Biomed Inform. 2022 Mar;127:103984. doi: 10.1016/j.jbi.2021.103984. Epub 2022 Jan 7.
5
Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT.在基于转换器的双向编码器表示预训练(BERT)中进行过采样,以定位医学 BERT 并增强生物医学 BERT。
Artif Intell Med. 2024 Jul;153:102889. doi: 10.1016/j.artmed.2024.102889. Epub 2024 May 5.
6
PharmBERT: a domain-specific BERT model for drug labels.PharmBERT:一种针对药物标签的特定领域 BERT 模型。
Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad226.
7
RadBERT: Adapting Transformer-based Language Models to Radiology.RadBERT:使基于Transformer的语言模型适用于放射学领域。
Radiol Artif Intell. 2022 Jun 15;4(4):e210258. doi: 10.1148/ryai.210258. eCollection 2022 Jul.
8
BioBERT and Similar Approaches for Relation Extraction.BioBERT 及其在关系抽取中的应用。
Methods Mol Biol. 2022;2496:221-235. doi: 10.1007/978-1-0716-2305-3_12.
9
Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.医学BERT:基于大规模结构化电子健康记录进行疾病预测的预训练上下文嵌入模型
NPJ Digit Med. 2021 May 20;4(1):86. doi: 10.1038/s41746-021-00455-y.
10
BERT-based Transfer Learning in Sentence-level Anatomic Classification of Free-Text Radiology Reports.基于BERT的自由文本放射学报告句子级解剖分类迁移学习
Radiol Artif Intell. 2023 Feb 15;5(2):e220097. doi: 10.1148/ryai.220097. eCollection 2023 Mar.

引用本文的文献

1
Enhancing Automatic PT Tagging for MEDLINE Citations Using Transformer-Based Models.使用基于Transformer的模型增强MEDLINE引文的自动PT标注
ArXiv. 2025 Jun 3:arXiv:2506.03321v1.
2
An explainable RoBERTa approach to analyzing panic and anxiety sentiment in oral health education YouTube comments.一种用于分析口腔健康教育YouTube评论中恐慌和焦虑情绪的可解释的RoBERTa方法。
Sci Rep. 2025 Jul 1;15(1):21737. doi: 10.1038/s41598-025-06560-2.
3
Using large language models for extracting stressful life events to assess their impact on preventive colon cancer screening adherence.

本文引用的文献

1
GRAM: Graph-based Attention Model for Healthcare Representation Learning.GRAM:用于医疗保健表示学习的基于图的注意力模型。
KDD. 2017 Aug;2017:787-795. doi: 10.1145/3097983.3098126.
2
Distant supervision for medical concept normalization.医学概念规范化的远程监督
J Biomed Inform. 2020 Sep;109:103522. doi: 10.1016/j.jbi.2020.103522. Epub 2020 Aug 9.
3
Chinese clinical named entity recognition with variant neural structures based on BERT methods.基于 BERT 方法的中文临床命名实体识别与变体神经结构。
使用大语言模型提取应激性生活事件以评估其对预防性结肠癌筛查依从性的影响。
BMC Public Health. 2025 Jan 2;25(1):12. doi: 10.1186/s12889-024-21123-2.
4
Estimating Patient Satisfaction Through a Language Processing Model: Model Development and Evaluation.通过语言处理模型评估患者满意度:模型开发与评估
JMIR Form Res. 2023 Sep 14;7:e48534. doi: 10.2196/48534.
5
Improving patient self-description in Chinese online consultation using contextual prompts.利用语境提示提高中文在线咨询中患者的自我描述。
BMC Med Inform Decis Mak. 2022 Jun 27;22(1):170. doi: 10.1186/s12911-022-01909-3.
J Biomed Inform. 2020 Jul;107:103422. doi: 10.1016/j.jbi.2020.103422. Epub 2020 Apr 28.
4
Clinical Text Data in Machine Learning: Systematic Review.机器学习中的临床文本数据:系统综述
JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984.
5
Improving rare disease classification using imperfect knowledge graph.利用不完善的知识图谱提高罕见病分类。
BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):238. doi: 10.1186/s12911-019-0938-1.
6
Deep learning in clinical natural language processing: a methodical review.深度学习在临床自然语言处理中的应用:系统综述。
J Am Med Inform Assoc. 2020 Mar 1;27(3):457-470. doi: 10.1093/jamia/ocz200.
7
Using clinical reasoning ontologies to make smarter clinical decision support systems: a systematic review and data synthesis.使用临床推理本体论构建更智能的临床决策支持系统:系统评价和数据综合。
J Am Med Inform Assoc. 2020 Jan 1;27(1):159-174. doi: 10.1093/jamia/ocz169.
8
Traditional Chinese medicine clinical records classification with BERT and domain specific corpora.基于 BERT 和领域专用语料库的中医临床记录分类。
J Am Med Inform Assoc. 2019 Dec 1;26(12):1632-1636. doi: 10.1093/jamia/ocz164.
9
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
10
Clinical text classification with rule-based features and knowledge-guided convolutional neural networks.基于规则特征和知识引导卷积神经网络的临床文本分类。
BMC Med Inform Decis Mak. 2019 Apr 4;19(Suppl 3):71. doi: 10.1186/s12911-019-0781-4.