George Mason University, United States of America.
George Mason University, United States of America.
J Biomed Inform. 2024 Oct;158:104727. doi: 10.1016/j.jbi.2024.104727. Epub 2024 Sep 16.
The reading level of health educational materials significantly influences the understandability and accessibility of the information, particularly for minoritized populations. Many patient educational resources surpass widely accepted standards for reading level and complexity. There is a critical need for high-performing text simplification models for health information to enhance dissemination and literacy. This need is particularly acute in cancer education, where effective prevention and screening education can substantially reduce morbidity and mortality.
We introduce Simplified Digestive Cancer (SimpleDC), a parallel corpus of cancer education materials tailored for health text simplification research, comprising educational content from the American Cancer Society, Centers for Disease Control and Prevention, and National Cancer Institute. The corpus includes 31 web pages with the corresponding manually simplified versions. It consists of 1183 annotated sentence pairs (361 train, 294 development, and 528 test). Utilizing SimpleDC and the existing Med-EASi corpus, we explore Large Language Model (LLM)-based simplification methods, including fine-tuning, reinforcement learning (RL), reinforcement learning with human feedback (RLHF), domain adaptation, and prompt-based approaches. Our experimentation encompasses Llama 2, Llama 3, and GPT-4. We introduce a novel RLHF reward function featuring a lightweight model adept at distinguishing between original and simplified texts when enables training on unlabeled data.
Fine-tuned Llama models demonstrated high performance across various metrics. Our RLHF reward function outperformed existing RL text simplification reward functions. The results underscore that RL/RLHF can achieve performance comparable to fine-tuning and improve the performance of fine-tuned models. Additionally, these methods effectively adapt out-of-domain text simplification models to a target domain. The best-performing RL-enhanced Llama models outperformed GPT-4 in both automatic metrics and manual evaluation by subject matter experts.
The newly developed SimpleDC corpus will serve as a valuable asset to the research community, particularly in patient education simplification. The RL/RLHF methodologies presented herein enable effective training of simplification models on unlabeled text and the utilization of out-of-domain simplification corpora.
健康教育培训材料的阅读水平显著影响信息的可理解性和可及性,尤其是对于少数族裔群体。许多患者教育资源的阅读水平和复杂程度超过了广泛接受的标准。因此,迫切需要用于健康信息的高性能文本简化模型来提高信息的传播和普及度。这种需求在癌症教育中尤为突出,因为有效的预防和筛查教育可以显著降低发病率和死亡率。
我们引入了简化版癌症教育(SimpleDC),这是一个专门为健康文本简化研究而设计的癌症教育材料平行语料库,其中包含了美国癌症协会、疾病控制与预防中心和国家癌症研究所的教育内容。该语料库包括 31 个网页及其对应的手动简化版本,共包含 1183 对已标注的句子对(361 个训练集、294 个开发集和 528 个测试集)。我们利用 SimpleDC 和现有的 Med-EASi 语料库,探索了基于大型语言模型(LLM)的简化方法,包括微调、强化学习(RL)、强化学习与人工反馈(RLHF)、领域自适应和基于提示的方法。我们的实验涵盖了 Llama 2、Llama 3 和 GPT-4。我们引入了一种新的 RLHF 奖励函数,该函数具有一个轻量级模型,当在未标记数据上进行训练时,该模型能够很好地区分原始文本和简化文本。
经过微调的 Llama 模型在各种指标上表现出色。我们的 RLHF 奖励函数优于现有的 RL 文本简化奖励函数。结果表明,RL/RLHF 可以实现与微调相当的性能,并提高微调模型的性能。此外,这些方法可以有效地将域外文本简化模型应用于目标域。表现最好的 RL 增强型 Llama 模型在自动指标和主题专家的人工评估方面均优于 GPT-4。
新开发的 SimpleDC 语料库将成为研究社区的宝贵资产,特别是在患者教育简化方面。本文提出的 RL/RLHF 方法能够有效地在未标记文本上训练简化模型,并利用域外简化语料库。