Jiang Lan, Lan Mengfei, Menke Joe D, Vorland Colby J, Kilicoglu Halil
School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, USA.
Indiana University, School of Public Health, Bloomington, IN, USA.
medRxiv. 2024 Apr 1:2024.03.31.24305138. doi: 10.1101/2024.03.31.24305138.
To develop text classification models for determining whether the checklist items in the CONSORT reporting guidelines are reported in randomized controlled trial publications.
Using a corpus annotated at the sentence level with 37 fine-grained CONSORT items, we trained several sentence classification models (PubMedBERT fine-tuning, BioGPT fine-tuning, and in-context learning with GPT-4) and compared their performance. To address the problem of small training dataset, we used several data augmentation methods (EDA, UMLS-EDA, text generation and rephrasing with GPT-4) and assessed their impact on the fine-tuned PubMedBERT model. We also fine-tuned PubMedBERT models limited to checklist items associated with specific sections (e.g., Methods) to evaluate whether such models could improve performance compared to the single full model. We performed 5-fold cross-validation and report precision, recall, F score, and area under curve (AUC).
Fine-tuned PubMedBERT model that takes as input the sentence and the surrounding sentence representations and uses section headers yielded the best overall performance (0.71 micro-F, 0.64 macro-F). Data augmentation had limited positive effect, UMLS-EDA yielding slightly better results than data augmentation using GPT-4. BioGPT fine-tuning and GPT-4 in-context learning exhibited suboptimal results. Methods-specific model yielded higher performance for methodology items, other section-specific models did not have significant impact.
Most CONSORT checklist items can be recognized reasonably well with the fine-tuned PubMedBERT model but there is room for improvement. Improved models can underpin the journal editorial workflows and CONSORT adherence checks and can help authors in improving the reporting quality and completeness of their manuscripts.
开发文本分类模型,以确定随机对照试验出版物中是否报告了CONSORT报告指南中的清单项目。
使用一个在句子级别标注了37个细粒度CONSORT项目的语料库,我们训练了几个句子分类模型(PubMedBERT微调、BioGPT微调以及使用GPT-4的上下文学习)并比较了它们的性能。为了解决训练数据集较小的问题,我们使用了几种数据增强方法(EDA、UMLS-EDA、文本生成以及使用GPT-4进行改写)并评估了它们对微调后的PubMedBERT模型的影响。我们还对仅限于与特定部分(如方法部分)相关的清单项目的PubMedBERT模型进行了微调,以评估与单一完整模型相比,此类模型是否能提高性能。我们进行了5折交叉验证,并报告了精确率、召回率、F分数和曲线下面积(AUC)。
以句子及其周围句子表示作为输入并使用章节标题的微调后的PubMedBERT模型产生了最佳的总体性能(微F值为0.71,宏F值为0.64)。数据增强的积极效果有限,UMLS-EDA产生的结果略优于使用GPT-4的数据增强。BioGPT微调以及GPT-4上下文学习表现出次优结果。特定于方法部分的模型在方法学项目上产生了更高的性能,其他特定于部分的模型没有显著影响。
经过微调的PubMedBERT模型能够较好地识别大多数CONSORT清单项目,但仍有改进空间。改进后的模型可以支持期刊编辑工作流程和CONSORT依从性检查,并有助于作者提高其稿件的报告质量和完整性。