使用微调的大语言模型增强语义文本理解：以Quora问题对重复识别为例的研究

Enhancing semantical text understanding with fine-tuned large language models: A case study on Quora Question Pair duplicate identification.

作者信息

Han Sifei, Shi Lingyun, Tsui Fuchiang Rich

机构信息

Department of Biomedical and Health Informatics, Tsui Laboratory, Children's Hospital of Philadelphia, Philadelphia, PA, United States of America.

Department of Anesthesiology and Critical Care, Children's Hospital of Philadelphia, Philadelphia, PA, United States of America.

出版信息

PLoS One. 2025 Jan 10;20(1):e0317042. doi: 10.1371/journal.pone.0317042. eCollection 2025.

DOI:10.1371/journal.pone.0317042

PMID:39792917

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11723592/

Abstract

Semantical text understanding holds significant importance in natural language processing (NLP). Numerous datasets, such as Quora Question Pairs (QQP), have been devised for this purpose. In our previous study, we developed a Siamese Convolutional Neural Network (S-CNN) that achieved an F1 score of 82.02% (95% C.I.: 81.83%-82.20%). Given the growing attention toward large language models (LLMs) like ChatGPT, we aimed to explore their effectiveness in text similarity tasks. In this research, we leveraged 5 pretrained LLMs, conducted various fine-tuning approaches (prompt engineering, n-shot learning, and supervised learning using the low-rank adaptation [LoRA]), and compared their performance using F1 score. To ensure a fair comparison, we followed our previous study's design and dataset by employing a 10-fold cross-validation for supervised model training and evaluation. Additionally, we conducted a secondary study by introducing a recent larger LLM with 70B parameters and comparing it with the 7B model using the GLUE benchmark, and both models were finetuned with the corpus. The fine-tuned LLaMA model with 7B parameters (qLLaMA_LoRA-7B) using 100,000 QQP corpus yielded the best results, achieving an F1 score of 84.9% (95% C.I.: 84.13%-85.67%), which outperformed the Alpaca_LoRA-65B (finetuned based on LLaMA-65B) (F1: 64.98% [64.72%-65.25%]; P<0.01) and had a 3% improvement compared to our previously published best model, S-CNN. The finetuned LLaMA3.1-70B (qLLaMA3.1_LoRA-70B) with 70B parameters (F1: 74.4%) outperformed the qLLaMA_LoRA-7B (F1: 71.9%) using the GLUE benchmark. The study demonstrated an effective LLM finetuning framework, which highlights the importance of finetuning LLMs for improved performance. Our task-specific supervised finetuning demonstrated improved LLM performance compared to larger pretrained models with or without n-shot learning; moreover, finetuning a larger LLM further improved performance compared to finetuning a smaller LLM. Our LLM-based finetuning framework may potentially improve various document similarity tasks, such as matching resumes with job descriptions, recommending subject-matter experts, or identifying potential reviewers for grant proposals or manuscript submissions.

摘要

语义文本理解在自然语言处理（NLP）中具有重要意义。为此，已经设计了许多数据集，如Quora问题对（QQP）。在我们之前的研究中，我们开发了一种连体卷积神经网络（S-CNN），其F1分数达到了82.02%（95%置信区间：81.83%-82.20%）。鉴于对ChatGPT等大型语言模型（LLM）的关注度不断提高，我们旨在探索它们在文本相似性任务中的有效性。在本研究中，我们利用了5个预训练的LLM，采用了各种微调方法（提示工程、n-shot学习以及使用低秩适应[LoRA]的监督学习），并使用F1分数比较它们的性能。为了确保公平比较，我们遵循之前研究的设计和数据集，对监督模型训练和评估采用10折交叉验证。此外，我们进行了一项二次研究，引入了一个最近的具有70B参数的更大的LLM，并使用GLUE基准将其与7B模型进行比较，两个模型都使用语料库进行了微调。使用100,000个QQP语料库对具有7B参数的微调LLaMA模型（qLLaMA_LoRA-7B）产生了最佳结果，F1分数达到84.9%（95%置信区间：84.13%-85.67%），超过了Alpaca_LoRA-65B（基于LLaMA-65B微调）（F1：64.98%[64.72%-65.25%]；P<0.01），并且与我们之前发表的最佳模型S-CNN相比有3%的提升。使用GLUE基准，具有70B参数的微调LLaMA3.1-70B（qLLaMA3.1_LoRA-70B）（F1：74.4%）优于qLLaMA_LoRA-7B（F1：71.9%）。该研究展示了一个有效的LLM微调框架，突出了微调LLM以提高性能的重要性。我们特定任务的监督微调表明，与有或没有n-shot学习的更大预训练模型相比，LLM性能有所提高；此外，与微调较小的LLM相比，微调更大的LLM进一步提高了性能。我们基于LLM的微调框架可能潜在地改善各种文档相似性任务，例如将简历与工作描述进行匹配、推荐主题专家或为资助提案或稿件提交识别潜在审稿人。