Suppr超能文献

大语言模型与同理心:系统综述

Large Language Models and Empathy: Systematic Review.

作者信息

Sorin Vera, Brin Dana, Barash Yiftach, Konen Eli, Charney Alexander, Nadkarni Girish, Klang Eyal

机构信息

Department of Radiology, Mayo Clinic, Rochester, MN, United States.

Department of Diagnostic Imaging, Sheba Medical Center, Ramat Gan, Israel.

出版信息

J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.

Abstract

BACKGROUND

Empathy, a fundamental aspect of human interaction, is characterized as the ability to experience another being's emotions within oneself. In health care, empathy is a fundamental for health care professionals and patients' interaction. It is a unique quality to humans that large language models (LLMs) are believed to lack.

OBJECTIVE

We aimed to review the literature on the capacity of LLMs in demonstrating empathy.

METHODS

We conducted a literature search on MEDLINE, Google Scholar, PsyArXiv, medRxiv, and arXiv between December 2022 and February 2024. We included English-language full-length publications that evaluated empathy in LLMs' outputs. We excluded papers evaluating other topics related to emotional intelligence that were not specifically empathy. The included studies' results, including the LLMs used, performance in empathy tasks, and limitations of the models, along with studies' metadata were summarized.

RESULTS

A total of 12 studies published in 2023 met the inclusion criteria. ChatGPT-3.5 (OpenAI) was evaluated in all studies, with 6 studies comparing it with other LLMs such GPT-4, LLaMA (Meta), and fine-tuned chatbots. Seven studies focused on empathy within a medical context. The studies reported LLMs to exhibit elements of empathy, including emotions recognition and emotional support in diverse contexts. Evaluation metric included automatic metrics such as Recall-Oriented Understudy for Gisting Evaluation and Bilingual Evaluation Understudy, and human subjective evaluation. Some studies compared performance on empathy with humans, while others compared between different models. In some cases, LLMs were observed to outperform humans in empathy-related tasks. For example, ChatGPT-3.5 was evaluated for its responses to patients' questions from social media, where ChatGPT's responses were preferred over those of humans in 78.6% of cases. Other studies used subjective readers' assigned scores. One study reported a mean empathy score of 1.84-1.9 (scale 0-2) for their fine-tuned LLM, while a different study evaluating ChatGPT-based chatbots reported a mean human rating of 3.43 out of 4 for empathetic responses. Other evaluations were based on the level of the emotional awareness scale, which was reported to be higher for ChatGPT-3.5 than for humans. Another study evaluated ChatGPT and GPT-4 on soft-skills questions in the United States Medical Licensing Examination, where GPT-4 answered 90% of questions correctly. Limitations were noted, including repetitive use of empathic phrases, difficulty following initial instructions, overly lengthy responses, sensitivity to prompts, and overall subjective evaluation metrics influenced by the evaluator's background.

CONCLUSIONS

LLMs exhibit elements of cognitive empathy, recognizing emotions and providing emotionally supportive responses in various contexts. Since social skills are an integral part of intelligence, these advancements bring LLMs closer to human-like interactions and expand their potential use in applications requiring emotional intelligence. However, there remains room for improvement in both the performance of these models and the evaluation strategies used for assessing soft skills.

摘要

背景

同理心是人际互动的一个基本方面,其特点是能够在自身内部体验他人的情感。在医疗保健领域,同理心是医疗保健专业人员与患者互动的基础。人们认为,大型语言模型(LLMs)缺乏人类所特有的这种品质。

目的

我们旨在回顾关于大型语言模型展现同理心能力的文献。

方法

我们在2022年12月至2024年2月期间对MEDLINE、谷歌学术、PsyArXiv、medRxiv和arXiv进行了文献检索。我们纳入了评估大型语言模型输出中同理心的英文全长出版物。我们排除了评估与情商相关的其他主题(而非专门的同理心)的论文。总结了纳入研究的结果,包括所使用的大型语言模型、在同理心任务中的表现、模型的局限性以及研究的元数据。

结果

2023年发表的共有12项研究符合纳入标准。所有研究都对ChatGPT - 3.5(OpenAI)进行了评估,其中6项研究将其与其他大型语言模型如GPT - 4、LLaMA(Meta)和微调的聊天机器人进行了比较。7项研究聚焦于医疗背景下的同理心。这些研究报告称大型语言模型展现出同理心的元素,包括在不同情境中的情感识别和情感支持。评估指标包括自动指标,如用于摘要评估的面向召回率的替代指标和双语评估替代指标,以及人类主观评估。一些研究将同理心方面的表现与人类进行比较,而其他研究则在不同模型之间进行比较。在某些情况下,观察到大型语言模型在与同理心相关的任务中表现优于人类。例如,对ChatGPT - 3.5对来自社交媒体的患者问题的回答进行了评估,在78.6%的案例中,ChatGPT的回答比人类的回答更受青睐。其他研究使用了主观读者给出的分数。一项研究报告称其微调后的大型语言模型的平均同理心得分为1.84 - 1.9(0 - 2分制),而另一项评估基于ChatGPT的聊天机器人的研究报告称,同理心回复的平均人类评分为4分中的3.43分。其他评估基于情感意识量表的水平,据报告ChatGPT - 3.5的该量表水平高于人类。另一项研究在美国医学执照考试中对ChatGPT和GPT - 4进行了软技能问题评估,其中GPT - 4正确回答了90%的问题。研究指出了局限性,包括共情短语的重复使用、难以遵循初始指令、回复过长、对提示敏感以及受评估者背景影响的整体主观评估指标。

结论

大型语言模型展现出认知同理心的元素,能够识别情感并在各种情境中提供情感支持性回应。由于社交技能是智能的一个组成部分,这些进展使大型语言模型更接近类人互动,并扩大了它们在需要情商的应用中的潜在用途。然而,这些模型的性能以及用于评估软技能的评估策略仍有改进空间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7c5/11669866/91d5643eab50/jmir_v26i1e52597_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验