Rehana Hasin, Zheng Jie, Yeh Leo, Bansal Benu, Çam Nur Bengisu, Jemiyo Christianah, McGregor Brett, Özgür Arzucan, He Yongqun, Hur Junguk
Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, North Dakota, 58202, USA.
School of Electrical Engineering & Computer Science, University of North Dakota, Grand Forks, North Dakota, 58202, USA.
ArXiv. 2025 Feb 12:arXiv:2502.09659v1.
An adjuvant is a chemical incorporated into vaccines that enhances their efficacy by improving the immune response. Identifying adjuvant names from cancer vaccine studies is essential for furthering research and enhancing immunotherapies. However, the manual curation from the constantly expanding biomedical literature poses significant challenges. This study explores the automated recognition of vaccine adjuvant names using state-of-the-art Large Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) and Large Language Model Meta AI (Llama).
We utilized two datasets: 97 clinical trial records from AdjuvareDB and 290 PubMed abstracts annotated with the Vaccine Adjuvant Compendium (VAC). Two LLMs, GPT-4o and Llama 3.2 were employed in zero-shot and few-shot learning paradigms with up to four examples per prompt. Prompts explicitly targeted adjuvant names, testing the impact of contextual information such as substances or interventions. Outputs underwent automated and manual validation for accuracy and consistency.
GPT-4o consistently attained 100% Precision across all situations, while also exhibiting notable enhancements in Recall and F1-scores, particularly with the incorporation of interventions. On the VAC dataset, GPT-4o achieved a maximum F1-score of 77.32% with interventions, surpassing Llama-3.2-3B by approximately 2%. On the AdjuvareDB dataset, GPT-4o reached an F1-score of 81.67% for three-shot prompting with interventions, surpassing Llama-3.2-3B's maximum F1-score of 65.62%. These results highlight the critical role of contextual information in enhancing model performance, with GPT-4o demonstrating a superior ability to leverage this enrichment.
Our findings demonstrate that LLMs excel at accurately identifying adjuvant names, including rare and novel variations of naming representation. This study emphasizes the capability of LLMs to enhance cancer vaccine development by efficiently extracting insights from clinical trial data. Future work aims to broaden the framework to encompass a wider array of biomedical literature and enhance model generalizability across various vaccines and adjuvants.
Source code is available at https://github.com/hurlab/Vaccine-Adjuvant-LLM.
佐剂是一种添加到疫苗中的化学物质,通过改善免疫反应来提高疫苗效力。从癌症疫苗研究中识别佐剂名称对于推进研究和增强免疫疗法至关重要。然而,从不断扩充的生物医学文献中进行人工整理面临重大挑战。本研究探索使用先进的大语言模型(LLMs),特别是生成式预训练变换器(GPT)和大语言模型元人工智能(Llama)自动识别疫苗佐剂名称。
我们使用了两个数据集:来自AdjuvareDB的97条临床试验记录和290篇用疫苗佐剂汇编(VAC)注释的PubMed摘要。两个大语言模型,GPT-4o和Llama 3.2,在零样本和少样本学习范式中使用,每个提示最多有四个示例。提示明确针对佐剂名称,测试诸如物质或干预等上下文信息的影响。对输出进行自动和人工验证以确保准确性和一致性。
GPT-4o在所有情况下始终达到100%的精确率,同时在召回率和F1分数方面也有显著提高,特别是在纳入干预措施时。在VAC数据集上,GPT-4o在纳入干预措施时达到了77.32%的最高F1分数,比Llama-3.2-3B高出约2%。在AdjuvareDB数据集上,GPT-4o在有干预措施的三样本提示下达到了81.67%的F1分数,超过了Llama-3.2-3B的最高F1分数65.62%。这些结果突出了上下文信息在提高模型性能方面的关键作用,GPT-4o展示了利用这种丰富信息的卓越能力。
我们的研究结果表明,大语言模型擅长准确识别佐剂名称,包括命名表示的罕见和新颖变体。本研究强调了大语言模型通过有效从临床试验数据中提取见解来促进癌症疫苗开发的能力。未来的工作旨在扩大框架以涵盖更广泛的生物医学文献,并提高模型在各种疫苗和佐剂上的通用性。