Suppr超能文献

用于烟草研究中使用大语言模型的检索增强生成的可扩展评估框架。

Scalable evaluation framework for retrieval augmented generation in tobacco research using large Language models.

作者信息

Elmitwalli Sherif, Mehegan John, Braznell Sophie, Gallagher Allen

机构信息

Tobacco Control Research Group, Department for Health, University of Bath, Bath, UK.

出版信息

Sci Rep. 2025 Jul 2;15(1):22760. doi: 10.1038/s41598-025-05726-2.

Abstract

Retrieval-augmented generation (RAG) systems show promise in specialized knowledge domains, but the tobacco research field lacks standardized assessment frameworks for comparing different large language models (LLMs). This gap impacts public health decisions that require accurate, domain-specific information retrieval from complex tobacco industry documentation. To develop and validate a tobacco domain-specific evaluation framework for assessing various LLMs in RAG systems that combines automated metrics with expert validation. Using a Goal-Question-Metric paradigm, we evaluated two distinct LLM architectures in RAG configurations: Mixtral 8 × 7B and Llama 3.1 70B. The framework incorporated automated assessments via GPT-4o alongside validation by three tobacco research specialists. A domain-specific dataset of 20 curated queries assessed model performance across nine metrics including accuracy, domain specificity, completeness, and clarity. Our framework successfully differentiated performance between models, with Mixtral 8 × 7B significantly outperformed Llama 3.1 70B in accuracy (8.8/10 vs. 7.55/10, p < 0.05) and domain specificity (8.65/10 vs. 7.6/10, p < 0.05). Case analysis revealed Mixtral's superior handling of industry-specific terminology and contextual relationships. Hyperparameter optimization further improved Mixtral's completeness from 7.1/10 to 7.9/10, demonstrating the framework's utility for model refinement. This study establishes a robust framework specifically for evaluating LLMs in tobacco research RAG systems, with demonstrated potential for extension to other specialized domains. The significant performance differences between models highlight the importance of domain-specific evaluation for public health applications. Future research should extend this framework to broader document corpora and additional LLMs, including commercial models.

摘要

检索增强生成(RAG)系统在专业知识领域显示出前景,但烟草研究领域缺乏用于比较不同大语言模型(LLM)的标准化评估框架。这一差距影响了公共卫生决策,这些决策需要从复杂的烟草行业文档中准确检索特定领域的信息。为了开发并验证一个烟草领域特定的评估框架,用于评估RAG系统中的各种LLM,该框架将自动化指标与专家验证相结合。使用目标-问题-指标范式,我们在RAG配置中评估了两种不同的LLM架构:Mixtral 8×7B和Llama 3.1 70B。该框架通过GPT-4o进行自动化评估,并由三位烟草研究专家进行验证。一个包含20个精心策划查询的特定领域数据集,评估了模型在九个指标上的性能,包括准确性、领域特异性、完整性和清晰度。我们的框架成功区分了模型之间的性能,Mixtral 8×7B在准确性(8.8/10对7.55/10,p<0.05)和领域特异性(8.65/10对7.6/10,p<0.05)方面显著优于Llama 3.1 70B。案例分析显示Mixtral在处理行业特定术语和上下文关系方面表现更优。超参数优化进一步将Mixtral的完整性从7.1/10提高到7.9/10,证明了该框架对模型优化的实用性。本研究建立了一个专门用于评估烟草研究RAG系统中LLM的强大框架,并展示了其扩展到其他专业领域的潜力。模型之间显著的性能差异凸显了特定领域评估对公共卫生应用的重要性。未来的研究应将此框架扩展到更广泛的文档语料库和更多的LLM,包括商业模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/371c/12219056/1360701af65d/41598_2025_5726_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验