大语言模型的检索增强用于生成通俗语言。

Retrieval augmentation of large language models for lay language generation.

机构信息

Biomedical and Health Informatics, University of Washington, United States of America.

Paul G. Allen School of Computer Science, University of Washington, United States of America.

出版信息

J Biomed Inform. 2024 Jan;149:104580. doi: 10.1016/j.jbi.2023.104580. Epub 2023 Dec 30.

Abstract

The complex linguistic structures and specialized terminology of expert-authored content limit the accessibility of biomedical literature to the general public. Automated methods have the potential to render this literature more interpretable to readers with different educational backgrounds. Prior work has framed such lay language generation as a summarization or simplification task. However, adapting biomedical text for the lay public includes the additional and distinct task of background explanation: adding external content in the form of definitions, motivation, or examples to enhance comprehensibility. This task is especially challenging because the source document may not include the required background knowledge. Furthermore, background explanation capabilities have yet to be formally evaluated, and little is known about how best to enhance them. To address this problem, we introduce Retrieval-Augmented Lay Language (RALL) generation, which intuitively fits the need for external knowledge beyond that in expert-authored source documents. In addition, we introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. To evaluate RALL, we augmented state-of-the-art text generation models with information retrieval of either term definitions from the UMLS and Wikipedia, or embeddings of explanations from Wikipedia documents. Of these, embedding-based RALL models improved summary quality and simplicity while maintaining factual correctness, suggesting that Wikipedia is a helpful source for background explanation in this context. We also evaluated the ability of both an open-source Large Language Model (Llama 2) and a closed-source Large Language Model (GPT-4) in background explanation, with and without retrieval augmentation. Results indicate that these LLMs can generate simplified content, but that the summary quality is not ideal. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. Our code and data are publicly available at: https://github.com/LinguisticAnomalies/pls_retrieval.

摘要

专家撰写的内容具有复杂的语言结构和专业术语,这使得生物医学文献对普通大众的可理解性受到限制。自动化方法有可能使读者更容易理解具有不同教育背景的读者。先前的工作将这种向非专业读者生成通俗语言的方法框架定义为一种总结或简化任务。然而,为非专业读者改编生物医学文本包括一个额外且独特的背景解释任务:以定义、动机或示例的形式添加外部内容,以提高可理解性。这项任务特别具有挑战性,因为源文档可能不包含所需的背景知识。此外,背景解释能力尚未得到正式评估,并且对于如何最好地增强这些能力知之甚少。为了解决这个问题,我们引入了检索增强通俗语言(RALL)生成,它直观地满足了对专家撰写的源文档之外的外部知识的需求。此外,我们引入了 CELLS,这是最大的(63k 对)和最广泛的(12 种期刊)通俗语言生成平行语料库。为了评估 RALL,我们使用 UMLS 和维基百科的术语定义信息检索,或维基百科文档的解释嵌入,为最先进的文本生成模型提供增强信息。在这些方法中,基于嵌入的 RALL 模型在保持事实正确性的同时,提高了摘要的质量和简洁性,这表明在这种情况下,维基百科是一个有助于背景解释的有用来源。我们还评估了开源大语言模型(Llama 2)和闭源大语言模型(GPT-4)在背景解释方面的能力,包括有检索增强和无检索增强的情况。结果表明,这些大语言模型可以生成简化的内容,但摘要的质量并不理想。总之,这项工作首次全面研究了通俗语言生成的背景解释,为向更广泛的受众传播科学知识铺平了道路。我们的代码和数据可在以下网址获得:https://github.com/LinguisticAnomalies/pls_retrieval。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索