Majdik Zoltan P, Graham S Scott, Shiva Edward Jade C, Rodriguez Sabrina N, Karnes Martha S, Jensen Jared T, Barbour Joshua B, Rousseau Justin F
Department of Communication, North Dakota State University, Fargo, ND, United States.
Department of Rhetoric & Writing, The University of Texas at Austin, Austin, TX, United States.
JMIR AI. 2024 May 16;3:e52095. doi: 10.2196/52095.
Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.
This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.
A random sample of 200 disclosure statements was prepared for annotation. All "PERSON" and "ORG" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.
Fine-tuned models ranged in topline NER performance from F-score=0.79 to F-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.
Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture's intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.
大语言模型(LLMs)有潜力支持健康信息学中前景广阔的新应用。然而,在生物医学和卫生政策背景下,关于微调大语言模型以执行特定任务时样本量考量的实际数据尚缺。
本研究旨在评估用于微调大语言模型的样本量和样本选择技术,以支持对利益冲突披露声明自定义数据集改进命名实体识别(NER)。
准备了200份披露声明的随机样本用于标注。两位评分者分别识别所有“人物”和“组织”实体,在达成适当一致后,注释者独立注释另外290份披露声明。从490份注释文档中,抽取了2500个不同大小范围的分层随机样本。这2500个训练集子样本用于微调2种模型架构(来自变换器的双向编码器表征[BERT]和生成式预训练变换器[GPT])中的一系列语言模型以改进命名实体识别,并用多元回归评估样本量(句子)、实体密度(每句实体数[EPS])与训练模型性能(F值)之间的关系。此外,使用单预测器阈值回归模型评估样本量或实体密度增加带来的边际收益递减的可能性。
跨架构微调后的模型在命名实体识别的最高性能方面,F值范围从0.79到0.96。双预测器多元线性回归模型具有统计学意义,复相关系数范围从0.6057到0.7896(所有P<.001)。在所有情况下,除了GPT - 2_large模型中EPS不是显著预测因子(P = 0.184)外,EPS和句子数量都是F值的显著预测因子(P<.001)。模型阈值表明,以句子数量衡量的训练数据集样本量增加带来的边际收益递减点,点估计范围从RoBERTa_large模型的439个句子到GPT - 2_large模型的527个句子。同样,阈值回归模型表明EPS的边际收益递减,点估计在1.36至1.38之间。
相对适中的样本量可用于微调大语言模型以执行应用于生物医学文本的命名实体识别任务,且训练数据实体密度应能代表性地近似生产数据中的实体密度。训练数据质量和模型架构的预期用途(文本生成与文本处理或分类)可能与训练数据量和模型参数大小同样重要或更重要。