Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
Department of Oncology, Johns Hopkins University, Baltimore, Maryland, USA.
J Immunother Cancer. 2024 May 30;12(5):e008599. doi: 10.1136/jitc-2023-008599.
BACKGROUND: Artificial intelligence (AI) chatbots have become a major source of general and medical information, though their accuracy and completeness are still being assessed. Their utility to answer questions surrounding immune-related adverse events (irAEs), common and potentially dangerous toxicities from cancer immunotherapy, are not well defined. METHODS: We developed 50 distinct questions with answers in available guidelines surrounding 10 irAE categories and queried two AI chatbots (ChatGPT and Bard), along with an additional 20 patient-specific scenarios. Experts in irAE management scored answers for accuracy and completion using a Likert scale ranging from 1 (least accurate/complete) to 4 (most accurate/complete). Answers across categories and across engines were compared. RESULTS: Overall, both engines scored highly for accuracy (mean scores for ChatGPT and Bard were 3.87 vs 3.5, p<0.01) and completeness (3.83 vs 3.46, p<0.01). Scores of 1-2 (completely or mostly inaccurate or incomplete) were particularly rare for ChatGPT (6/800 answer-ratings, 0.75%). Of the 50 questions, all eight physician raters gave ChatGPT a rating of 4 (fully accurate or complete) for 22 questions (for accuracy) and 16 questions (for completeness). In the 20 patient scenarios, the average accuracy score was 3.725 (median 4) and the average completeness was 3.61 (median 4). CONCLUSIONS: AI chatbots provided largely accurate and complete information regarding irAEs, and wildly inaccurate information ("hallucinations") was uncommon. However, until accuracy and completeness increases further, appropriate guidelines remain the gold standard to follow.
背景:人工智能(AI)聊天机器人已成为获取一般和医学信息的主要来源,但它们的准确性和完整性仍在评估中。其在回答与免疫相关的不良反应(irAE)相关问题方面的效用,即癌症免疫疗法常见且潜在危险的毒性问题,尚未得到明确界定。
方法:我们围绕 10 个 irAE 类别开发了 50 个具有答案的不同问题,并查询了两个 AI 聊天机器人(ChatGPT 和 Bard),以及另外 20 个患者特定场景。irAE 管理专家使用 1 到 4 的李克特量表(1 表示最不准确/不完整,4 表示最准确/完整)对答案的准确性和完整性进行评分。比较了各个类别和各个引擎的答案。
结果:总体而言,两个引擎的准确性得分都很高(ChatGPT 和 Bard 的平均得分分别为 3.87 和 3.5,p<0.01),完整性得分也很高(3.83 和 3.46,p<0.01)。ChatGPT 的 1-2 分(完全或主要不准确或不完整)评分非常少见(6/800 回答评分,0.75%)。在 50 个问题中,所有 8 位医生评分者都对 ChatGPT 的 22 个问题(准确性)和 16 个问题(完整性)给予了 4 分(完全准确或完整)的评分。在 20 个患者场景中,平均准确性得分为 3.725(中位数 4),平均完整性得分为 3.61(中位数 4)。
结论:AI 聊天机器人提供了关于 irAE 的大部分准确和完整信息,且极不准确的信息(“幻觉”)并不常见。然而,在准确性和完整性进一步提高之前,适当的指南仍然是遵循的黄金标准。
J Immunother Cancer. 2024-5-30
Am J Orthod Dentofacial Orthop. 2024-6
Cont Lens Anterior Eye. 2024-4
BMJ Oncol. 2025-5-15
J Med Internet Res. 2025-6-4
BMC Oral Health. 2025-2-20
J Immunother Cancer. 2024-12-4
JAMA Netw Open. 2024-10-1
J Alzheimers Dis Rep. 2024-3-19
Clin Genitourin Cancer. 2024-4
JAMA Netw Open. 2023-10-2
JAMA Oncol. 2023-10-1