Suppr超能文献

使用大语言模型从低资源语言的噪声数据中提取和预注释关于心理健康的文本。

Using large language models for extracting and pre-annotating texts on mental health from noisy data in a low-resource language.

作者信息

Koltcov Sergei, Surkov Anton, Koltsova Olessia, Ignatenko Vera

机构信息

Laboratory for Social & Cognitive Informatics, National Research University Higher School of Economics, St. Petersburg, Russia.

出版信息

PeerJ Comput Sci. 2024 Nov 28;10:e2395. doi: 10.7717/peerj-cs.2395. eCollection 2024.

Abstract

Recent advancements in large language models (LLMs) have opened new possibilities for developing conversational agents (CAs) in various subfields of mental healthcare. However, this progress is hindered by limited access to high-quality training data, often due to privacy concerns and high annotation costs for low-resource languages. A potential solution is to create human-AI annotation systems that utilize extensive public domain user-to-user and user-to-professional discussions on social media. These discussions, however, are extremely noisy, necessitating the adaptation of LLMs for fully automatic cleaning and pre-classification to reduce human annotation effort. To date, research on LLM-based annotation in the mental health domain is extremely scarce. In this article, we explore the potential of zero-shot classification using four LLMs to select and pre-classify texts into topics representing psychiatric disorders, in order to facilitate the future development of CAs for disorder-specific counseling. We use 64,404 Russian-language texts from online discussion threads labeled with seven most commonly discussed disorders: depression, neurosis, paranoia, anxiety disorder, bipolar disorder, obsessive-compulsive disorder, and borderline personality disorder. Our research shows that while preliminary data filtering using zero-shot technology slightly improves classification, LLM fine-tuning makes a far larger contribution to its quality. Both standard and natural language inference (NLI) modes of fine-tuning increase classification accuracy by more than three times compared to non-fine-tuned training with preliminarily filtered data. Although NLI fine-tuning achieves slightly higher accuracy (0.64) than the standard approach, it is six times slower, indicating a need for further experimentation with NLI hypothesis engineering. Additionally, we demonstrate that lemmatization does not affect classification quality and that multilingual models using texts in their original language perform slightly better than English-only models using automatically translated texts. Finally, we introduce our dataset and model as the first openly available Russian-language resource for developing conversational agents in the domain of mental health counseling.

摘要

大语言模型(LLMs)的最新进展为在精神卫生保健的各个子领域开发对话代理(CAs)开辟了新的可能性。然而,这一进展受到高质量训练数据获取有限的阻碍,这通常是由于隐私问题以及低资源语言的高标注成本所致。一个潜在的解决方案是创建人机协作标注系统,该系统利用社交媒体上广泛的公共领域用户对用户以及用户对专业人士的讨论。然而,这些讨论极其嘈杂,因此需要对大语言模型进行调整以实现全自动清理和预分类,从而减少人工标注工作量。迄今为止,在精神卫生领域基于大语言模型的标注研究极为稀少。在本文中,我们探索了使用四个大语言模型进行零样本分类的潜力,以便将文本选择并预分类为代表精神疾病的主题,从而促进针对特定疾病咨询的对话代理的未来发展。我们使用了来自在线讨论线程的64404篇俄语文章,这些文章标注了七种最常讨论的疾病:抑郁症、神经症、偏执狂、焦虑症、双相情感障碍、强迫症和边缘性人格障碍。我们的研究表明,虽然使用零样本技术进行初步数据过滤可略微提高分类效果,但大语言模型微调对分类质量的贡献要大得多。与使用初步过滤数据的非微调训练相比,标准和自然语言推理(NLI)模式的微调均使分类准确率提高了三倍多。虽然NLI微调的准确率(0.64)略高于标准方法,但其速度慢六倍,这表明需要对NLI假设工程进行进一步实验。此外,我们证明词形还原不会影响分类质量,并且使用原文文本的多语言模型的表现略优于使用自动翻译文本的仅英文模型。最后,我们将我们的数据集和模型作为首个公开可用的俄语资源引入,用于在精神卫生咨询领域开发对话代理。

相似文献

引用本文的文献

1
Artificial Intelligence in Obsessive-Compulsive Disorder: A Systematic Review.强迫症中的人工智能:一项系统综述。
Curr Treat Options Psychiatry. 2025;12(1):23. doi: 10.1007/s40501-025-00359-8. Epub 2025 Jun 14.

本文引用的文献

1
Opportunities and Risks of Large Language Models in Psychiatry.大型语言模型在精神病学中的机遇与风险
NPP Digit Psychiatry Neurosci. 2024;2(1). doi: 10.1038/s44277-024-00010-z. Epub 2024 May 24.
3
Large language models in psychiatry: Opportunities and challenges.精神医学中的大语言模型:机遇与挑战。
Psychiatry Res. 2024 Sep;339:116026. doi: 10.1016/j.psychres.2024.116026. Epub 2024 Jun 11.
8
A Review of Generalized Zero-Shot Learning Methods.广义零样本学习方法综述
IEEE Trans Pattern Anal Mach Intell. 2023 Apr;45(4):4051-4070. doi: 10.1109/TPAMI.2022.3191696. Epub 2023 Mar 7.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验