Instituto Federal de Educação, Ciência e Tecnologia do Ceará, Fortaleza, Brasil.
Universidade Federal do Delta do Parnaíba, Parnaíba, Brasil.
Cad Saude Publica. 2024 Nov 25;40(10):e00028824. doi: 10.1590/0102-311XEN028824. eCollection 2024.
Artificial intelligence can detect suicidal ideation manifestations in texts. Studies demonstrate that BERT-based models achieve better performance in text classification problems. Large language models (LLMs) answer free-text queries without being specifically trained. This work aims to compare the performance of three variations of BERT models and LLMs (Google Bard, Microsoft Bing/GPT-4, and OpenAI ChatGPT-3.5) for identifying suicidal ideation from nonclinical texts written in Brazilian Portuguese. A dataset labeled by psychologists consisted of 2,691 sentences without suicidal ideation and 1,097 with suicidal ideation, of which 100 sentences were selected for testing. We applied data preprocessing techniques, hyperparameter optimization, and hold-out cross-validation for training and testing BERT models. When evaluating LLMs, we used zero-shot prompting engineering. Each test sentence was labeled if it contained suicidal ideation, according to the chatbot's response. Bing/GPT-4 achieved the best performance, with 98% across all metrics. Fine-tuned BERT models outperformed the other LLMs: BERTimbau-Large performed the best with a 96% accuracy, followed by BERTimbau-Base with 94%, and BERT-Multilingual with 87%. Bard performed the worst with 62% accuracy, whereas ChatGPT-3.5 achieved 81%. The high recall capacity of the models suggests a low misclassification rate of at-risk patients, which is crucial to prevent missed interventions by professionals. However, despite their potential in supporting suicidal ideation detection, these models have not been validated in a patient monitoring clinical setting. Therefore, caution is advised when using the evaluated models as tools to assist healthcare professionals in detecting suicidal ideation.
人工智能可以检测文本中的自杀意念表现。研究表明,基于 BERT 的模型在文本分类问题中表现更好。大型语言模型(LLM)可以在没有专门训练的情况下回答自由文本查询。这项工作旨在比较三种 BERT 模型变体和 LLM(谷歌 Bard、微软 Bing/GPT-4 和 OpenAI ChatGPT-3.5)在识别巴西葡萄牙语非临床文本中自杀意念的性能。一个由心理学家标记的数据集由 2691 个没有自杀意念的句子和 1097 个有自杀意念的句子组成,其中 100 个句子被选来测试。我们应用了数据预处理技术、超参数优化和留一交叉验证来训练和测试 BERT 模型。在评估 LLM 时,我们使用了零样本提示工程。根据聊天机器人的回复,对每个测试句子进行了是否包含自杀意念的标记。Bing/GPT-4 在所有指标上的表现都最好,达到了 98%。经过微调的 BERT 模型优于其他 LLM:BERTimbau-Large 的准确率最高,为 96%,其次是 BERTimbau-Base,为 94%,BERT-Multilingual 为 87%。Bard 的表现最差,准确率为 62%,而 ChatGPT-3.5 的准确率为 81%。模型的高召回能力表明,对高危患者的误分类率较低,这对于防止专业人员错过干预措施至关重要。然而,尽管这些模型在支持自杀意念检测方面具有潜力,但它们尚未在患者监测临床环境中得到验证。因此,在将评估模型用作辅助医疗保健专业人员检测自杀意念的工具时,应谨慎使用。