• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于命名实体识别任务的大语言模型微调的样本量考量:方法学研究

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

作者信息

Majdik Zoltan P, Graham S Scott, Shiva Edward Jade C, Rodriguez Sabrina N, Karnes Martha S, Jensen Jared T, Barbour Joshua B, Rousseau Justin F

机构信息

Department of Communication, North Dakota State University, Fargo, ND, United States.

Department of Rhetoric & Writing, The University of Texas at Austin, Austin, TX, United States.

出版信息

JMIR AI. 2024 May 16;3:e52095. doi: 10.2196/52095.

DOI:10.2196/52095
PMID:38875593
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11140272/
Abstract

BACKGROUND

Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.

OBJECTIVE

This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.

METHODS

A random sample of 200 disclosure statements was prepared for annotation. All "PERSON" and "ORG" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.

RESULTS

Fine-tuned models ranged in topline NER performance from F-score=0.79 to F-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.

CONCLUSIONS

Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture's intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.

摘要

背景

大语言模型(LLMs)有潜力支持健康信息学中前景广阔的新应用。然而,在生物医学和卫生政策背景下,关于微调大语言模型以执行特定任务时样本量考量的实际数据尚缺。

目的

本研究旨在评估用于微调大语言模型的样本量和样本选择技术,以支持对利益冲突披露声明自定义数据集改进命名实体识别(NER)。

方法

准备了200份披露声明的随机样本用于标注。两位评分者分别识别所有“人物”和“组织”实体,在达成适当一致后,注释者独立注释另外290份披露声明。从490份注释文档中,抽取了2500个不同大小范围的分层随机样本。这2500个训练集子样本用于微调2种模型架构(来自变换器的双向编码器表征[BERT]和生成式预训练变换器[GPT])中的一系列语言模型以改进命名实体识别,并用多元回归评估样本量(句子)、实体密度(每句实体数[EPS])与训练模型性能(F值)之间的关系。此外,使用单预测器阈值回归模型评估样本量或实体密度增加带来的边际收益递减的可能性。

结果

跨架构微调后的模型在命名实体识别的最高性能方面,F值范围从0.79到0.96。双预测器多元线性回归模型具有统计学意义,复相关系数范围从0.6057到0.7896(所有P<.001)。在所有情况下,除了GPT - 2_large模型中EPS不是显著预测因子(P = 0.184)外,EPS和句子数量都是F值的显著预测因子(P<.001)。模型阈值表明,以句子数量衡量的训练数据集样本量增加带来的边际收益递减点,点估计范围从RoBERTa_large模型的439个句子到GPT - 2_large模型的527个句子。同样,阈值回归模型表明EPS的边际收益递减,点估计在1.36至1.38之间。

结论

相对适中的样本量可用于微调大语言模型以执行应用于生物医学文本的命名实体识别任务,且训练数据实体密度应能代表性地近似生产数据中的实体密度。训练数据质量和模型架构的预期用途(文本生成与文本处理或分类)可能与训练数据量和模型参数大小同样重要或更重要。

相似文献

1
Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.用于命名实体识别任务的大语言模型微调的样本量考量:方法学研究
JMIR AI. 2024 May 16;3:e52095. doi: 10.2196/52095.
2
Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study.评估医疗保健中的实体识别:实体模型定量研究。
JMIR Med Inform. 2024 Oct 17;12:e59782. doi: 10.2196/59782.
3
Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。
Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.
4
A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation.基于 Transformer 的双向编码器表示模型的精细调整在食品命名实体识别中的应用:算法开发与验证。
J Med Internet Res. 2021 Aug 9;23(8):e28229. doi: 10.2196/28229.
5
Advancing entity recognition in biomedicine via instruction tuning of large language models.通过指令调整大型语言模型推进生物医学中的实体识别。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae163.
6
Using Large Language Models to Annotate Complex Cases of Social Determinants of Health in Longitudinal Clinical Records.使用大语言模型注释纵向临床记录中健康社会决定因素的复杂病例。
medRxiv. 2024 Apr 27:2024.04.25.24306380. doi: 10.1101/2024.04.25.24306380.
7
Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.Transformer-sklearn:一个基于 Transformer 的模型的医学语言理解工具包。
BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.
8
Improving large language models for clinical named entity recognition via prompt engineering.通过提示工程改进临床命名实体识别的大型语言模型。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.
9
Multi-Label Classification in Patient-Doctor Dialogues With the RoBERTa-WWM-ext + CNN (Robustly Optimized Bidirectional Encoder Representations From Transformers Pretraining Approach With Whole Word Masking Extended Combining a Convolutional Neural Network) Model: Named Entity Study.基于RoBERTa-WWM-ext + CNN(带有全词掩码扩展的基于变换器预训练方法的稳健优化双向编码器表示与卷积神经网络相结合)模型的医患对话多标签分类:命名实体研究
JMIR Med Inform. 2022 Apr 21;10(4):e35606. doi: 10.2196/35606.
10
Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.基于大规模电子健康记录笔记对基于变换器的双向编码器表征(BERT)模型进行微调:一项实证研究。
JMIR Med Inform. 2019 Sep 12;7(3):e14830. doi: 10.2196/14830.

引用本文的文献

1
AI in conjunctivitis research: assessing ChatGPT and DeepSeek for etiology, intervention, and citation integrity via hallucination rate analysis.人工智能在结膜炎研究中的应用:通过幻觉率分析评估ChatGPT和百川智能在病因、干预措施及引用完整性方面的表现
Front Artif Intell. 2025 Aug 20;8:1579375. doi: 10.3389/frai.2025.1579375. eCollection 2025.
2
Precision in Parsing: Evaluation of an Open-Source Named Entity Recognizer (NER) in Veterinary Oncology.解析中的精准度:兽医肿瘤学中一种开源命名实体识别器(NER)的评估
Vet Comp Oncol. 2025 Mar;23(1):102-108. doi: 10.1111/vco.13035. Epub 2024 Dec 23.
3
DeepEnhancerPPO: An Interpretable Deep Learning Approach for Enhancer Classification.

本文引用的文献

1
Large language models in health care: Development, applications, and challenges.医疗保健领域的大语言模型:发展、应用与挑战。
Health Care Sci. 2023 Jul 24;2(4):255-263. doi: 10.1002/hcs2.61. eCollection 2023 Aug.
2
Lessons learned from translating AI from development to deployment in healthcare.从人工智能在医疗保健领域从开发到部署的过程中吸取的经验教训。
Nat Med. 2023 Jun;29(6):1304-1306. doi: 10.1038/s41591-023-02293-9.
3
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
深度增强子近端策略优化算法:一种用于增强子分类的可解释深度学习方法。
Int J Mol Sci. 2024 Dec 2;25(23):12942. doi: 10.3390/ijms252312942.
4
Use of artificial intelligence algorithms to analyse systemic sclerosis-interstitial lung disease imaging features.使用人工智能算法分析系统性硬化症-间质性肺病的影像学特征。
Rheumatol Int. 2024 Oct;44(10):2027-2041. doi: 10.1007/s00296-024-05681-7. Epub 2024 Aug 29.
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.
4
Evidence for stratified conflicts of interest policies in research contexts: a methodological review.有分层利益冲突政策的研究证据:方法学综述。
BMJ Open. 2022 Sep 19;12(9):e063501. doi: 10.1136/bmjopen-2022-063501.
5
Tasks as needs: reframing the paradigm of clinical natural language processing research for real-world decision support.任务需求:为现实世界的决策支持重新构建临床自然语言处理研究的范例。
J Am Med Inform Assoc. 2022 Sep 12;29(10):1810-1817. doi: 10.1093/jamia/ocac121.
6
Associations Between Aggregate NLP-Extracted Conflicts of Interest and Adverse Events by Drug Product.药物产品的整体 NLP 提取利益冲突与不良事件之间的关联。
Stud Health Technol Inform. 2022 Jun 6;290:405-409. doi: 10.3233/SHTI220106.
7
A systematic review on natural language processing systems for eligibility prescreening in clinical research.自然语言处理系统在临床研究资格筛选中的应用:系统评价
J Am Med Inform Assoc. 2021 Dec 28;29(1):197-206. doi: 10.1093/jamia/ocab228.
8
Benchmarking Modern Named Entity Recognition Techniques for Free-text Health Record Deidentification.针对自由文本健康记录去识别的现代命名实体识别技术的基准测试。
AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:102-111. eCollection 2021.
9
Structuring clinical text with AI: Old versus new natural language processing techniques evaluated on eight common cardiovascular diseases.利用人工智能构建临床文本:在八种常见心血管疾病上对新旧自然语言处理技术进行评估
Patterns (N Y). 2021 Jun 17;2(7):100289. doi: 10.1016/j.patter.2021.100289. eCollection 2021 Jul 9.
10
Extracting Drug Names and Associated Attributes From Discharge Summaries: Text Mining Study.从出院小结中提取药物名称及相关属性:文本挖掘研究
JMIR Med Inform. 2021 May 5;9(5):e24678. doi: 10.2196/24678.