Suppr超能文献

家庭护理中的人工智能——对用于未来非正式护理人员培训的大语言模型的评估:观察性比较案例研究

AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.

作者信息

Pérez-Esteve Clara, Guilabert Mercedes, Matarredona Valerie, Srulovici Einav, Tella Susanna, Strametz Reinhard, Mira José Joaquín

机构信息

Fundación para el Fomento de la Investigación Sanitaria y Biomédica de la Comunitat Valenciana, Centro de Salud Hospital-Plá, Alicante, Spain.

Health Psychology Department, Miguel Hernandez University, Elche, Spain.

出版信息

J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.

Abstract

BACKGROUND

The aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate adequate and safe care, including in-home settings. Traditionally, informal caregiver training has relied on verbal and written instructions. However, the advent of digital resources has introduced videos and interactive platforms, offering more accessible and effective training. Large language models (LLMs) have emerged as potential tools for personalized information delivery. While LLMs exhibit the capacity to mimic clinical reasoning and support decision-making, their potential to serve as alternatives to evidence-based professional instruction remains unexplored.

OBJECTIVE

We aimed to evaluate the appropriateness of home care instructions generated by LLMs (including GPTs) in comparison to a professional gold standard. Furthermore, it seeks to identify specific domains where LLMs show the most promise and where improvements are necessary to optimize their reliability for caregiver training.

METHODS

An observational, comparative case study evaluated 3 LLMs-GPT-3.5, GPT-4o, and Microsoft Copilot-in 10 home care scenarios. A rubric assessed the models against a reference standard (gold standard) created by health care professionals. Independent reviewers evaluated variables including specificity, clarity, and self-efficacy. In addition to comparing each LLM to the gold standard, the models were also compared against each other across all study domains to identify relative strengths and weaknesses. Statistical analyses compared LLMs performance to the gold standard to ensure consistency and validity, as well as to analyze differences between LLMs across all evaluated domains.

RESULTS

The study revealed that while no LLM achieved the precision of the professional gold standard, GPT-4o outperformed GPT-3.5, and Copilot in specificity (4.6 vs 3.7 and 3.6), clarity (4.8 vs 4.1 and 3.9), and self-efficacy (4.6 vs 3.8 and 3.4). However, the models exhibited significant limitations, with GPT-4o and Copilot omitting relevant details in 60% (6/10) of the cases, and GPT-3.5 doing so in 80% (8/10). When compared to the gold standard, only 10% (2/20) of GPT-4o responses were rated as equally specific, 20% (4/20) included comparable practical advice, and just 5% (1/20) provided a justification as detailed as professional guidance. Furthermore, error frequency did not differ significantly across models (P=.65), though Copilot had the highest rate of incorrect information (20%, 2/10 vs 10%, 1/10 for GPT-4o and 0%, 0/0 for GPT-3.5).

CONCLUSIONS

LLMs, particularly GPT-4o subscription-based, show potential as tools for training informal caregivers by providing tailored guidance and reducing errors. Although not yet surpassing professional instruction quality, these models offer a flexible and accessible alternative that could enhance home safety and care quality. Further research is necessary to address limitations and optimize their performance. Future implementation of LLMs may alleviate health care system burdens by reducing common caregiver errors.

摘要

背景

人口老龄化对社会而言是一项成就,但也给政府、医疗保健系统和护理人员带来了重大挑战。老年人功能受限率升高,主要由慢性病导致,这就需要提供充分且安全的护理,包括居家护理。传统上,非正式护理人员培训依赖口头和书面指导。然而,数字资源的出现引入了视频和互动平台,提供了更易获取且有效的培训。大语言模型(LLMs)已成为个性化信息传递的潜在工具。虽然大语言模型展现出模仿临床推理和支持决策的能力,但其作为基于证据的专业指导替代品的潜力仍未得到探索。

目的

我们旨在评估大语言模型(包括GPT系列)生成的居家护理指导与专业黄金标准相比的适用性。此外,它试图确定大语言模型最具潜力的特定领域以及为优化其在护理人员培训中的可靠性而需要改进的方面。

方法

一项观察性、对比性案例研究在10个居家护理场景中评估了3个大语言模型——GPT - 3.5、GPT - 4o和微软Copilot。一个评分标准根据医疗保健专业人员创建的参考标准(黄金标准)对这些模型进行评估。独立评审员评估包括特异性、清晰度和自我效能感等变量。除了将每个大语言模型与黄金标准进行比较外,还在所有研究领域将这些模型相互比较,以确定相对优势和劣势。统计分析将大语言模型的表现与黄金标准进行比较,以确保一致性和有效性,并分析所有评估领域中不同大语言模型之间的差异。

结果

研究表明,虽然没有一个大语言模型达到专业黄金标准的精度,但GPT - 4o在特异性(4.6对3.7和3.6)、清晰度(4.8对4.1和3.9)和自我效能感(4.6对3.8和3.4)方面优于GPT - 3.5和Copilot。然而,这些模型存在显著局限性,GPT - 4o和Copilot在60%(6/10)的案例中遗漏了相关细节,GPT - 3.5在80%(8/10)的案例中遗漏了相关细节。与黄金标准相比,GPT - 4o的回答中只有10%(2/20)被评为同样具体,20%(4/20)包含可比的实用建议,只有5%(1/20)提供了与专业指导一样详细的理由。此外,各模型之间的错误频率没有显著差异(P = 0.65),尽管Copilot的错误信息率最高(20%,2/10,而GPT - 4o为10%,1/10,GPT - 3.5为0%,0/0)。

结论

大语言模型,特别是基于GPT - 4o订阅版的模型,通过提供量身定制的指导和减少错误,显示出作为培训非正式护理人员工具的潜力。尽管尚未超越专业指导质量,但这些模型提供了一种灵活且易获取的替代方案,可提高居家安全性和护理质量。有必要进行进一步研究以解决局限性并优化其性能。大语言模型的未来应用可能通过减少护理人员常见错误来减轻医疗保健系统的负担。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/27bf/12070015/41c5146bcf43/jmir_v27i1e70703_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验