Khabaz Kameel, Newman-Hung Nicole J, Kallini Jennifer R, Kendal Joseph, Christ Alexander B, Bernthal Nicholas M, Wessel Lauren E
David Geffen School of Medicine at UCLA, Los Angeles, California, USA.
Department of Orthopaedic Surgery, University of California, Los Angeles, California, USA.
J Surg Oncol. 2025 Mar;131(4):719-724. doi: 10.1002/jso.27966. Epub 2024 Oct 29.
BACKGROUND AND OBJECTIVES: The potential impacts of artificial intelligence (AI) chatbots on care for patients with bone sarcoma is poorly understood. Elucidating potential risks and benefits would allow surgeons to define appropriate roles for these tools in clinical care. METHODS: Eleven questions on bone sarcoma diagnosis, treatment, and recovery were inputted into three AI chatbots. Answers were assessed on a 5-point Likert scale for five clinical accuracy metrics: relevance to the question, balance and lack of bias, basis on established data, factual accuracy, and completeness in scope. Responses were quantitatively assessed for empathy and readability. The Patient Education Materials Assessment Tool (PEMAT) was assessed for understandability and actionability. RESULTS: Chatbots scored highly on relevance (4.24) and balance/lack of bias (4.09) but lower on basing responses on established data (3.77), completeness (3.68), and factual accuracy (3.66). Responses generally scored well on understandability (84.30%), while actionability scores were low for questions on treatment (64.58%) and recovery (60.64%). GPT-4 exhibited the highest empathy (4.12). Readability scores averaged between 10.28 for diagnosis questions to 11.65 for recovery questions. CONCLUSIONS: While AI chatbots are promising tools, current limitations in factual accuracy and completeness, as well as concerns of inaccessibility to populations with lower health literacy, may significantly limit their clinical utility.
背景与目的:人工智能(AI)聊天机器人对骨肉瘤患者护理的潜在影响尚不清楚。阐明潜在风险和益处将有助于外科医生确定这些工具在临床护理中的适当作用。 方法:将11个关于骨肉瘤诊断、治疗和康复的问题输入三个AI聊天机器人。根据相关性、平衡与无偏差、基于既定数据、事实准确性和范围完整性这五个临床准确性指标,采用5分李克特量表对答案进行评估。对回答的同理心和可读性进行定量评估。使用患者教育材料评估工具(PEMAT)评估其可理解性和可操作性。 结果:聊天机器人在相关性(4.24)和平衡/无偏差(4.09)方面得分较高,但在基于既定数据的回答(3.77)、完整性(3.68)和事实准确性(3.66)方面得分较低。回答在可理解性方面总体得分良好(84.30%),而关于治疗(64.58%)和康复(60.64%)问题的可操作性得分较低。GPT-4表现出最高的同理心(4.12)。可读性得分从诊断问题的平均10.28到康复问题的11.65不等。 结论:虽然AI聊天机器人是很有前景的工具,但目前在事实准确性和完整性方面的局限性,以及对健康素养较低人群难以获取信息的担忧,可能会显著限制其临床效用。
Ophthalmic Plast Reconstr Surg. 2024-12-24
JAMA Netw Open. 2024-10-1
Ann Allergy Asthma Immunol. 2025-7
Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz. 2025-6-25
Health Care Sci. 2023-7-24
J Exp Orthop. 2023-12-1
JAMA Netw Open. 2023-10-2