一种可靠且易于使用的护理语言模型（CaLM），用于支持护理人员工具：开发与评估研究。

A Reliable and Accessible Caregiving Language Model (CaLM) to Support Tools for Caregivers: Development and Evaluation Study.

作者信息

Parmanto Bambang, Aryoyudanta Bayu, Soekinto Timothius Wilbert, Setiawan I Made Agus, Wang Yuhan, Hu Haomin, Saptono Andi, Choi Yong Kyung

机构信息

Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, United States.

出版信息

JMIR Form Res. 2024 Jul 31;8:e54633. doi: 10.2196/54633.

DOI:10.2196/54633

PMID:39083337

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11325100/

Abstract

BACKGROUND

In the United States, 1 in 5 adults currently serves as a family caregiver for an individual with a serious illness or disability. Unlike professional caregivers, family caregivers often assume this role without formal preparation or training. Thus, there is an urgent need to enhance the capacity of family caregivers to provide quality care. Leveraging technology as an educational tool or an adjunct to care is a promising approach that has the potential to enhance the learning and caregiving capabilities of family caregivers. Large language models (LLMs) can potentially be used as a foundation technology for supporting caregivers. An LLM can be categorized as a foundation model (FM), which is a large-scale model trained on a broad data set that can be adapted to a range of different domain tasks. Despite their potential, FMs have the critical weakness of "hallucination," where the models generate information that can be misleading or inaccurate. Information reliability is essential when language models are deployed as front-line help tools for caregivers.

OBJECTIVE

This study aimed to (1) develop a reliable caregiving language model (CaLM) by using FMs and a caregiving knowledge base, (2) develop an accessible CaLM using a small FM that requires fewer computing resources, and (3) evaluate the model's performance compared with a large FM.

METHODS

We developed a CaLM using the retrieval augmented generation (RAG) framework combined with FM fine-tuning for improving the quality of FM answers by grounding the model on a caregiving knowledge base. The key components of the CaLM are the caregiving knowledge base, a fine-tuned FM, and a retriever module. We used 2 small FMs as candidates for the foundation of the CaLM (LLaMA [large language model Meta AI] 2 and Falcon with 7 billion parameters) and adopted a large FM (GPT-3.5 with an estimated 175 billion parameters) as a benchmark. We developed the caregiving knowledge base by gathering various types of documents from the internet. We focused on caregivers of individuals with Alzheimer disease and related dementias. We evaluated the models' performances using the benchmark metrics commonly used in evaluating language models and their reliability for providing accurate references with their answers.

RESULTS

The RAG framework improved the performance of all FMs used in this study across all measures. As expected, the large FM performed better than the small FMs across all metrics. Interestingly, the small fine-tuned FMs with RAG performed significantly better than GPT 3.5 across all metrics. The fine-tuned LLaMA 2 with a small FM performed better than GPT 3.5 (even with RAG) in returning references with the answers.

CONCLUSIONS

The study shows that a reliable and accessible CaLM can be developed using small FMs with a knowledge base specific to the caregiving domain.

摘要

背景

在美国，目前每五名成年人中就有一人担任患有严重疾病或残疾者的家庭照顾者。与专业照顾者不同，家庭照顾者通常在没有正式准备或培训的情况下承担这一角色。因此，迫切需要提高家庭照顾者提供优质护理的能力。利用技术作为教育工具或护理辅助手段是一种很有前景的方法，有可能提高家庭照顾者的学习和护理能力。大语言模型（LLMs）有潜力被用作支持照顾者的基础技术。大语言模型可被归类为基础模型（FM），它是一种在广泛数据集上训练的大规模模型，可适应一系列不同的领域任务。尽管有潜力，但基础模型存在“幻觉”这一关键弱点，即模型生成的信息可能具有误导性或不准确。当语言模型作为照顾者的一线帮助工具部署时，信息可靠性至关重要。

目的

本研究旨在（1）通过使用基础模型和护理知识库开发一个可靠的护理语言模型（CaLM），（2）使用需要较少计算资源的小型基础模型开发一个易于访问的CaLM，以及（3）与大型基础模型相比评估该模型的性能。

方法

我们使用检索增强生成（RAG）框架结合基础模型微调来开发CaLM，通过将模型建立在护理知识库的基础上提高基础模型答案的质量。CaLM的关键组件是护理知识库、微调后的基础模型和检索模块。我们使用2个小型基础模型作为CaLM基础的候选模型（Meta AI的大语言模型LLaMA 2和具有70亿参数的Falcon），并采用一个大型基础模型（估计有1750亿参数的GPT-3.5）作为基准。我们通过从互联网上收集各种类型的文档来开发护理知识库。我们关注的是患有阿尔茨海默病及相关痴呆症患者的照顾者。我们使用评估语言模型常用的基准指标及其答案提供准确参考的可靠性来评估模型的性能。

结果

RAG框架在所有指标上都提高了本研究中使用的所有基础模型的性能。正如预期的那样，大型基础模型在所有指标上的表现都优于小型基础模型。有趣的是，经过RAG微调的小型基础模型在所有指标上的表现都明显优于GPT 3.5。经过微调的小型LLaMA 2在答案中返回参考文献方面的表现优于GPT 3.5（即使使用了RAG）。