Johnson Ashish J, Singh Tarun Kumar, Gupta Aakash, Sankar Hariram, Gill Ikroop, Shalini Madhav, Mohan Neeraj
All India Institute of Medical Sciences (AIIMS), Bathinda, India.
Maulana Azad Institute of Dental Science, New Delhi, India.
Dent Traumatol. 2025 Apr;41(2):187-193. doi: 10.1111/edt.13000. Epub 2024 Oct 17.
This study aimed to assess the validity and reliability of AI chatbots, including Bing, ChatGPT 3.5, Google Gemini, and Claude AI, in addressing frequently asked questions (FAQs) related to dental trauma.
A set of 30 FAQs was initially formulated by collecting responses from four AI chatbots. A panel comprising expert endodontists and maxillofacial surgeons then refined these to a final selection of 20 questions. Each question was entered into each chatbot three times, generating a total of 240 responses. These responses were evaluated using the Global Quality Score (GQS) on a 5-point Likert scale (5: strongly agree; 4: agree; 3: neutral; 2: disagree; 1: strongly disagree). Any disagreements in scoring were resolved through evidence-based discussions. The validity of the responses was determined by categorizing them as valid or invalid based on two thresholds: a low threshold (scores of ≥ 4 for all three responses) and a high threshold (scores of 5 for all three responses). A chi-squared test was used to compare the validity of the responses between the chatbots. Cronbach's alpha was calculated to assess the reliability by evaluating the consistency of repeated responses from each chatbot.
The results indicate that the Claude AI chatbot demonstrated superior validity and reliability compared to ChatGPT and Google Gemini, whereas Bing was found to be less reliable. These findings underscore the need for authorities to establish strict guidelines to ensure the accuracy of medical information provided by AI chatbots.
本研究旨在评估包括必应(Bing)、ChatGPT 3.5、谷歌Gemini和Claude AI在内的人工智能聊天机器人在回答与牙外伤相关的常见问题(FAQ)方面的有效性和可靠性。
最初通过收集四个人工智能聊天机器人的回答制定了一组30个常见问题。然后,由牙髓病专家和颌面外科医生组成的小组将这些问题提炼为最终选定的20个问题。每个问题在每个聊天机器人中输入三次,共产生240个回答。使用全球质量评分(GQS)在5点李克特量表(5:强烈同意;4:同意;3:中立;2:不同意;1:强烈不同意)上对这些回答进行评估。评分中的任何分歧都通过基于证据的讨论来解决。根据两个阈值将回答分类为有效或无效来确定回答的有效性:低阈值(所有三个回答的分数≥4)和高阈值(所有三个回答的分数为5)。使用卡方检验比较聊天机器人之间回答的有效性。通过评估每个聊天机器人重复回答的一致性来计算Cronbach's alpha以评估可靠性。
结果表明,与ChatGPT和谷歌Gemini相比,Claude AI聊天机器人表现出更高的有效性和可靠性,而必应被发现可靠性较低。这些发现强调当局需要制定严格的指导方针,以确保人工智能聊天机器人提供的医疗信息的准确性。