Haider Zafaryab, Rahman Md Hafizur, Devabhaktuni Vijay, Moeykens Shane, Chakraborty Prabuddha
Department of Electrical and Computer Engineering (ECE), University of Maine, Orono, ME, USA.
Department of Electrical and Computer Engineering (ECE), Illinois State University, Normal, IL, USA.
Sci Rep. 2025 Mar 17;15(1):9177. doi: 10.1038/s41598-025-92889-7.
Large Language models (LLMs) have demonstrated impressive capabilities in natural language processing and understanding. LLMs are being rapidly adopted in major industry sectors including mobile computing, healthcare, finance, government, and education driven by technology giants such as NVIDIA, OpenAI, Microsoft, Apple, Meta, Google, Broadcom, AMD, and IBM. However, due to the emerging nature of this technology, many security/privacy challenges remain unresolved that we must tackle before rolling out LLMs to critical applications (e.g. Healthcare, Legal). In this article, we focus on the Reinforcement Learning via Human Feedback (RLHF) process that is widely used for training LLMs giving them the human-like feel most applications value. The RLHF process involves employing human experts to generate feedback based on an LLM's query-response pairs and using this feedback to then retrain (fine-tune) the model. However, RLHF can also expose the LLM to malicious feedback generated by one or more individuals in the process leading to degraded performance of the LLM and harmful responses. Most state-of-the-art (SOTA) solutions to this problem involve utilizing a KL-Divergence-based brute-force update-rejection approach that can render the whole RLHF process completely useless (model quality is not improved) in the presence of malicious entities in the process. We propose the COnsensus-Based RewArd framework (COBRA), a consensus-based technique that can effectively negate the malicious noise generated by a certain segment of the RLHF human-expert pool, leading to improved LLM training performance in a mixed-trust scenario. We have evaluated COBRA for two separate LLM use cases, Sentiment Analysis and Conversational Task. We have experimented with a wide range of LLM models (e.g. GPT-2 XL - 1.5B parameters). COBRA outperformed the standard unprotected reward generation scheme by [Formula: see text] for the generative conversational task and by [Formula: see text] for the sentiment analysis task. We have also quantitatively compared COBRA with Coste et al. and observed state-of-the-art performance, particularly when a lower number of reward models are used ([Formula: see text] increased reward accuracy at [Formula: see text]).
大语言模型(LLMs)在自然语言处理和理解方面展现出了令人印象深刻的能力。在英伟达、OpenAI、微软、苹果、Meta、谷歌、博通、AMD和IBM等科技巨头的推动下,大语言模型正在被迅速应用于包括移动计算、医疗保健、金融、政府和教育在内的主要行业领域。然而,由于这项技术的新兴性质,在将大语言模型推广到关键应用(如医疗保健、法律)之前,许多安全/隐私挑战仍未得到解决。在本文中,我们重点关注通过人类反馈进行强化学习(RLHF)的过程,该过程广泛用于训练大语言模型,赋予它们大多数应用所看重的类人感觉。RLHF过程包括聘请人类专家根据大语言模型的查询 - 响应对生成反馈,并使用此反馈对模型进行再训练(微调)。然而,RLHF也可能使大语言模型在这个过程中受到一个或多个人生成的恶意反馈的影响,从而导致大语言模型性能下降和有害响应。针对这个问题,大多数最先进的(SOTA)解决方案涉及使用基于KL散度的强力更新拒绝方法,在存在恶意实体的情况下,这可能会使整个RLHF过程完全无用(模型质量没有提高)。我们提出了基于共识的奖励框架(COBRA),这是一种基于共识的技术,可以有效消除RLHF人类专家群体中某一部分产生的恶意噪声,从而在混合信任场景中提高大语言模型的训练性能。我们针对两个独立的大语言模型用例,即情感分析和对话任务,对COBRA进行了评估。我们对多种大语言模型(如GPT - 2 XL - 1.5B参数)进行了实验。在生成式对话任务中,COBRA比标准的无保护奖励生成方案性能提升了[公式:见原文],在情感分析任务中提升了[公式:见原文]。我们还将COBRA与科斯特等人的方法进行了定量比较,并观察到了其最先进的性能,特别是在使用较少奖励模型时(在[公式:见原文]的情况下,奖励准确率提高了[公式:见原文])。