基于 GPT-4 和 Gemini 的生物医学实体关系抽取数据增强与分类优化方法

Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini.

机构信息

Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan.

出版信息

Database (Oxford). 2024 Oct 9;2024. doi: 10.1093/database/baae104.

DOI:10.1093/database/baae104

PMID:39383312

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11463225/

Abstract

Despite numerous research efforts by teams participating in the BioCreative VIII Track 01 employing various techniques to achieve the high accuracy of biomedical relation tasks, the overall performance in this area still has substantial room for improvement. Large language models bring a new opportunity to improve the performance of existing techniques in natural language processing tasks. This paper presents our improved method for relation extraction, which involves integrating two renowned large language models: Gemini and GPT-4. Our new approach utilizes GPT-4 to generate augmented data for training, followed by an ensemble learning technique to combine the outputs of diverse models to create a more precise prediction. We then employ a method using Gemini responses as input to fine-tune the BioNLP-PubMed-Bert classification model, which leads to improved performance as measured by precision, recall, and F1 scores on the same test dataset used in the challenge evaluation. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/.

摘要

尽管参与 BioCreative VIII 赛道 01 的团队进行了大量研究工作，采用了各种技术来实现生物医学关系任务的高精度，但在这一领域的整体性能仍有很大的改进空间。大型语言模型为提高现有技术在自然语言处理任务中的性能带来了新的机会。本文提出了我们改进的关系抽取方法，该方法涉及整合两个著名的大型语言模型：Gemini 和 GPT-4。我们的新方法利用 GPT-4 生成增强数据进行训练，然后采用集成学习技术结合不同模型的输出，以创建更精确的预测。接下来，我们采用一种使用 Gemini 响应作为输入的方法来微调 BioNLP-PubMed-Bert 分类模型，这导致在挑战赛评估中使用的相同测试数据集上的精度、召回率和 F1 分数等方面的性能得到提高。数据库 URL：https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/。