推进健康辅导：大型语言模型与健康辅导员的比较研究。

Advancing health coaching: A comparative study of large language model and health coaches.

机构信息

Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Rd, 308232, Singapore; School of Public Health, Imperial College London, 90 Wood Ln, London W12 0BZ, United Kingdom.

Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Rd, 308232, Singapore.

出版信息

Artif Intell Med. 2024 Nov;157:103004. doi: 10.1016/j.artmed.2024.103004. Epub 2024 Oct 19.

OBJECTIVE: Recent advances in large language models (LLM) offer opportunities to automate health coaching. With zero-shot learning ability, LLMs could revolutionize health coaching by providing better accessibility, scalability, and customization. The aim of this study is to compare the quality of responses to clients' sleep-related questions provided by health coaches and an LLM. DESIGN, SETTING, AND PARTICIPANTS: From a de-identified dataset of coaching conversations from a pilot randomized controlled trial, we extracted 100 question-answer pairs comprising client questions and corresponding health coach responses. These questions were entered into a retrieval-augmented generation (RAG)-enabled open-source LLM (LLaMa-2-7b-chat) to generate LLM responses. Out of 100 question-answer pairs, 90 were taken out and assigned to three groups of evaluators: experts, lay-users, and GPT-4. Each group conducted two evaluation tasks: (Task 1) a single-response quality assessment spanning five criteria-accuracy, readability, helpfulness, empathy, and likelihood of harm-rated on a five-point Likert scale, and (Task 2) a pairwise comparison to choose the superior response between pairs. A suite of inferential statistical methods, including the paired and independent sample t-tests, Pearson correlation, and chi-square tests, were utilized to answer the study objective. Recognizing potential biases in human judgment, the remaining 10 question-answer pairs were used to assess inter-evaluator reliability among the human evaluators, quantified using the interclass correlation coefficient and percentage agreement metrics. RESULTS: Upon exclusion of incomplete data, the analysis included 178 single-response evaluations (Task 1) and 83 pairwise comparisons (Task 2). Expert and GPT-4 assessments revealed no discernible disparities in health coach and LLM responses across the five metrics. Contrarily, lay-users deemed LLM responses significantly more helpful than that of human coaches (p < 0.05). LLM responses were preferred in the majority (62.25 %, n = 155) of the aggregate 249 assessments, with all three evaluator groups favoring LLM over health coach inputs. While GPT-4 rated both health coach and LLM responses significantly higher than experts in terms of readability, helpfulness, and empathy, its ratings on accuracy and likelihood of harm aligned with those of experts. Response length positively correlated with accuracy and empathy scores, but negatively affected readability across all evaluator groups. Expert and lay-user evaluators demonstrated moderate to high inter-evaluator reliability. CONCLUSION: Our study showed encouraging findings by demonstrating that RAG-enabled LLM has comparable performance to health coaches in the domain tested. Serving as an initial step towards the creation of more sophisticated, adaptive, round-the-clock automated health coaching systems, our findings call for more extensive evaluation which could assist in the development of the model that could in the future lead to potential clinical implementation.

目的：大型语言模型（LLM）的最新进展为自动化健康辅导提供了机会。凭借零样本学习能力，LLM 可以通过提供更好的可及性、可扩展性和定制化来彻底改变健康辅导。本研究旨在比较健康辅导师和 LLM 对客户睡眠相关问题的回复质量。

设计、设置和参与者：从一项试点随机对照试验的辅导对话去识别数据集，我们提取了 100 个问答对，包括客户问题和相应的健康辅导师回复。这些问题被输入到一个检索增强生成（RAG）启用的开源 LLM（LLaMa-2-7b-chat）中，以生成 LLM 回复。在 100 个问答对中，有 90 个被分配给三个评估者小组：专家、非专业人士和 GPT-4。每个小组进行了两项评估任务：（任务 1）对五个标准（准确性、可读性、有用性、同理心和产生伤害的可能性）进行单项回复质量评估，每个标准均按五分制的李克特量表进行评分；（任务 2）进行两两比较，以选择两两比较中更好的回复。使用了一系列推理统计方法，包括配对和独立样本 t 检验、皮尔逊相关和卡方检验，来回答研究目标。认识到人类判断的潜在偏差，我们使用剩余的 10 个问答对来评估人类评估者之间的评估者间可靠性，使用组内相关系数和百分比一致性指标进行量化。

结果：在排除不完整数据后，分析包括 178 个单项回复评估（任务 1）和 83 个成对比较（任务 2）。专家和 GPT-4 的评估结果表明，在五个指标上，健康辅导师和 LLM 的回复没有明显差异。相反，非专业人士认为 LLM 的回复比健康辅导师的回复更有帮助（p<0.05）。在总共 249 项评估中，大部分（62.25%，n=155）都倾向于 LLM 回复，所有三个评估者小组都更倾向于 LLM 回复而不是健康辅导师的输入。虽然 GPT-4 在可读性、有用性和同理心方面对健康辅导师和 LLM 的回复评价都显著高于专家，但在准确性和产生伤害的可能性方面的评价与专家一致。回复长度与准确性和同理心得分呈正相关，但在所有评估者小组中都与可读性呈负相关。专家和非专业人士评估者之间表现出中等至高的评估者间可靠性。

结论：我们的研究结果令人鼓舞，表明 RAG 启用的 LLM 在测试领域的表现与健康辅导师相当。作为创建更复杂、自适应、全天候自动化健康辅导系统的初步步骤，我们的发现呼吁进行更广泛的评估，这将有助于开发未来可能在临床上实施的模型。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

Advancing health coaching: A comparative study of large language model and health coaches.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

推荐工具