Pichowicz W, Kotas M, Piotrowski P
Faculty of Medicine, Wroclaw Medical University, Pasteura 1 Street, Wrocław, 50-367, Poland.
Laboratory of Immunopathology, Department of Experimental Therapy, Hirszfeld Institute of Immunology and Experimental Therapy, Polish Academy of Sciences, Weigla 12 Street, 53-114, Wroclaw, Poland.
Sci Rep. 2025 Aug 27;15(1):31652. doi: 10.1038/s41598-025-17242-4.
Advances in artificial intelligence (AI) technologies sparked a rapid development of smartphone applications designed to help individuals experiencing mental health problems through an AI-powered chatbot agent. However, the safety of such agents when dealing with individuals experiencing a mental health crisis, including suicidal crisis, has not been evaluated. In this study, we assessed the ability of 29 AI-powered chatbot agents to respond to simulated suicidal risk scenarios. Application repositories were searched and the app descriptions screened in search of apps that claimed to be beneficial when experiencing mental distress and offered an AI-powered chatbot function. All agents were tested with a standardized set of prompts based on the Columbia-Suicide Severity Rating Scale designed to simulate increasing suicidal risk. We assessed the responses according to pre-defined criteria based on the ability to provide emergency contact information and other factors. None of the tested agents satisfied our initial criteria for an adequate response, 51.72% satisfied the relaxed criteria for a marginal response, while 48.28% were deemed inadequate. Common errors included the inability to provide emergency contact information and a lack of contextual understanding. These findings raise concerns about the deployment of AI-powered chatbots in sensitive health contexts without proper clinical validation.
人工智能(AI)技术的进步促使了旨在通过人工智能驱动的聊天机器人代理来帮助有心理健康问题的个人的智能手机应用程序迅速发展。然而,此类代理在处理包括自杀危机在内的心理健康危机患者时的安全性尚未得到评估。在本研究中,我们评估了29个由人工智能驱动的聊天机器人代理对模拟自杀风险场景的应对能力。我们搜索了应用程序存储库并筛选了应用描述,以寻找那些声称在经历精神痛苦时有益且提供人工智能驱动的聊天机器人功能的应用程序。所有代理都使用了一组基于哥伦比亚自杀严重程度评定量表的标准化提示进行测试,该量表旨在模拟不断增加的自杀风险。我们根据预先定义的标准,基于提供紧急联系信息的能力和其他因素来评估回复。没有一个测试代理满足我们对充分回复的初始标准,51.72%的代理满足边缘回复的宽松标准,而48.28%被认为不充分。常见错误包括无法提供紧急联系信息和缺乏情境理解。这些发现引发了人们对在没有适当临床验证的情况下,在敏感健康环境中部署人工智能驱动的聊天机器人的担忧。