Srivastava Prateek, Tewari Ashish, Al-Riyami Arwa Z
Department of Transfusion Medicine, Medanta, Lucknow, India.
Department of Haematology, Sultan Qaboos university Hospital, University Medical City, Muscat, Oman.
Vox Sang. 2025 Mar 5. doi: 10.1111/vox.70009.
The recent rise of artificial intelligence (AI) chatbots has attracted many users worldwide. However, expert evaluation is essential before relying on them for transfusion medicine (TM)-related information. This study aims to evaluate the performance of AI chatbots for accuracy, correctness, completeness and safety.
Six AI chatbots (ChatGPT 4, ChatGPT 4-o, Gemini Advanced, Copilot, Anthropic Claude 3.5 Sonnet, Meta AI) were tested using TM-related prompts at two time points, 30 days apart. Their responses were assessed by four TM experts. Evaluators' scores underwent inter-rater reliability testing. Responses from Day 30 were compared with those from Day 1 to evaluate consistency and potential evolution over time.
All six chatbots exhibited some level of inconsistency and varying degrees of evolution in their responses over 30 days. None provided entirely correct, complete or safe answers to all questions. Among the chatbots tested, ChatGPT 4-o and Anthropic Claude 3.5 Sonnet demonstrated the highest accuracy and consistency, while Microsoft Copilot and Google Gemini Advanced showed the greatest evolution in their responses. As a limitation, the 30-day period may be too short for a precise assessment of chatbot evolution.
At the time of the conduct of this study, none of the AI chatbots provided fully reliable, complete or safe responses to all TM-related prompts. However, ChatGPT 4-o and Anthropic Claude 3.5 Sonnet show the highest promise for future integration into TM practices. Given their variability and ongoing development, AI chatbots should not yet be relied upon as authoritative sources in TM without expert validation.
近期人工智能(AI)聊天机器人的兴起吸引了全球众多用户。然而,在依靠它们获取输血医学(TM)相关信息之前,专家评估至关重要。本研究旨在评估AI聊天机器人在准确性、正确性、完整性和安全性方面的表现。
在两个时间点(间隔30天)使用与TM相关的提示对六个AI聊天机器人(ChatGPT 4、ChatGPT 4-o、Gemini Advanced、Copilot、Anthropic Claude 3.5 Sonnet、Meta AI)进行测试。由四位TM专家对它们的回答进行评估。评估者的分数进行了评分者间信度测试。将第30天的回答与第1天的回答进行比较,以评估随时间的一致性和潜在演变。
所有六个聊天机器人在30天内的回答都表现出一定程度的不一致性和不同程度的演变。没有一个能对所有问题提供完全正确、完整或安全的答案。在测试的聊天机器人中,ChatGPT 4-o和Anthropic Claude 3.5 Sonnet表现出最高的准确性和一致性,而Microsoft Copilot和Google Gemini Advanced的回答变化最大。作为一个局限性,30天的时间段可能太短,无法精确评估聊天机器人的演变。
在本研究进行时,没有一个AI聊天机器人能对所有与TM相关的提示提供完全可靠、完整或安全的回答。然而,ChatGPT 4-o和Anthropic Claude 3.5 Sonnet在未来融入TM实践方面显示出最大的潜力。鉴于它们的变异性和持续发展,在没有专家验证的情况下,AI聊天机器人尚不应用作TM中的权威信息来源。