Soong David, Sridhar Sriram, Si Han, Wagner Jan-Samuel, Sá Ana Caroline Costa, Yu Christina Y, Karagoz Kubra, Guan Meijian, Kumar Sanyam, Hamadeh Hisham, Higgs Brandon W
Translational Data Sciences, Genmab, Princeton, New Jersey, United States of America.
Data Sciences and AI, Genmab, Princeton, New Jersey, United States of America.
PLOS Digit Health. 2024 Aug 21;3(8):e0000568. doi: 10.1371/journal.pdig.0000568. eCollection 2024 Aug.
Large language models (LLMs) have made a significant impact on the fields of general artificial intelligence. General purpose LLMs exhibit strong logic and reasoning skills and general world knowledge but can sometimes generate misleading results when prompted on specific subject areas. LLMs trained with domain-specific knowledge can reduce the generation of misleading information (i.e. hallucinations) and enhance the precision of LLMs in specialized contexts. Training new LLMs on specific corpora however can be resource intensive. Here we explored the use of a retrieval-augmented generation (RAG) model which we tested on literature specific to a biomedical research area. OpenAI's GPT-3.5, GPT-4, Microsoft's Prometheus, and a custom RAG model were used to answer 19 questions pertaining to diffuse large B-cell lymphoma (DLBCL) disease biology and treatment. Eight independent reviewers assessed LLM responses based on accuracy, relevance, and readability, rating responses on a 3-point scale for each category. These scores were then used to compare LLM performance. The performance of the LLMs varied across scoring categories. On accuracy and relevance, the RAG model outperformed other models with higher scores on average and the most top scores across questions. GPT-4 was more comparable to the RAG model on relevance versus accuracy. By the same measures, GPT-4 and GPT-3.5 had the highest scores for readability of answers when compared to the other LLMs. GPT-4 and 3.5 also had more answers with hallucinations than the other LLMs, due to non-existent references and inaccurate responses to clinical questions. Our findings suggest that an oncology research-focused RAG model may outperform general-purpose LLMs in accuracy and relevance when answering subject-related questions. This framework can be tailored to Q&A in other subject areas. Further research will help understand the impact of LLM architectures, RAG methodologies, and prompting techniques in answering questions across different subject areas.
大型语言模型(LLMs)对通用人工智能领域产生了重大影响。通用型LLMs展现出强大的逻辑和推理能力以及一般的世界知识,但在针对特定主题领域进行提示时,有时会产生误导性结果。使用特定领域知识训练的LLMs可以减少误导性信息(即幻觉)的产生,并提高LLMs在特定情境下的精度。然而,在特定语料库上训练新的LLMs可能会消耗大量资源。在此,我们探索了使用检索增强生成(RAG)模型,并在生物医学研究领域的特定文献上进行了测试。使用OpenAI的GPT-3.5、GPT-4、微软的Prometheus以及一个定制的RAG模型来回答19个与弥漫性大B细胞淋巴瘤(DLBCL)疾病生物学和治疗相关的问题。八位独立评审员根据准确性、相关性和可读性对LLM的回答进行评估,在每个类别上以三分制对回答进行评分。然后使用这些分数来比较LLM的性能。LLMs的性能在不同评分类别中有所不同。在准确性和相关性方面,RAG模型表现优于其他模型,平均得分更高,且在所有问题中获得最高分的数量最多。在相关性与准确性方面,GPT-4与RAG模型更为可比。通过相同的衡量标准,与其他LLMs相比,GPT-4和GPT-3.5在答案可读性方面得分最高。由于不存在的参考文献以及对临床问题的不准确回答,GPT-4和3.5产生幻觉的答案也比其他LLMs更多。我们的研究结果表明,在回答与主题相关的问题时,专注于肿瘤学研究的RAG模型在准确性和相关性方面可能优于通用型LLMs。该框架可针对其他主题领域的问答进行定制。进一步的研究将有助于理解LLM架构、RAG方法和提示技术在回答不同主题领域问题时的影响。