Xu Xuhai, Yao Bingsheng, Dong Yuanzhe, Gabriel Saadia, Yu Hong, Hendler James, Ghassemi Marzyeh, Dey Anind K, Wang Dakuo
Massachusetts Institute of Technology & University of Washington, USA.
Rensselaer Polytechnic Institute, USA.
Proc ACM Interact Mob Wearable Ubiquitous Technol. 2024 Mar;8(1). doi: 10.1145/3643540. Epub 2024 Mar 6.
Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.
大语言模型(LLMs)的进展推动了各种应用。然而,在理解和增强LLMs在心理健康领域的能力方面,研究仍存在显著差距。在这项工作中,我们通过在线文本数据对多个LLMs在各种心理健康预测任务上进行了全面评估,包括Alpaca、Alpaca-LoRA、FLAN-T5、GPT-3.5和GPT-4。我们进行了广泛的实验,涵盖零样本提示、少样本提示和指令微调。结果表明,对于心理健康任务,零样本和少样本提示设计的LLMs表现出有前景但有限的性能。更重要的是,我们的实验表明,指令微调可以同时显著提高LLMs在所有任务上的性能。我们经过最佳微调的模型Mental-Alpaca和Mental-FLAN-T5,在平衡准确率上比GPT-3.5的最佳提示设计(大25倍和15倍)高出10.9%,比GPT-4的最佳设计(大250倍和150倍)高出4.8%。它们的表现进一步与最先进的特定任务语言模型相当。我们还对LLMs在心理健康推理任务上的能力进行了探索性案例研究,展示了某些模型(如GPT-4)的有前景的能力。我们将研究结果总结为一套行动指南,用于增强LLMs心理健康任务能力的潜在方法。同时,我们也强调在实际心理健康环境中实现可部署性之前的重要局限性,如已知的种族和性别偏见。我们突出了这一研究方向伴随的重要伦理风险。