Suppr超能文献

基于世界卫生组织预防手术部位感染全球指南评估最先进的人工智能聊天机器人的性能:横断面研究

Evaluating the Performance of State-of-the-Art Artificial Intelligence Chatbots Based on the WHO Global Guidelines for the Prevention of Surgical Site Infection: Cross-Sectional Study.

作者信息

Wang Tianyi, Chen Ruiyuan, Wang Baodong, Zou Congying, Fan Ning, Yuan Shuo, Wang Aobo, Xi Yu, Zang Lei

机构信息

Beijing Chao-Yang Hospital, 5 JingYuan Road, Shijingshan District, Beijing, 100043, China, 86 51718268.

出版信息

J Med Internet Res. 2025 Jul 31;27:e75567. doi: 10.2196/75567.

Abstract

BACKGROUND

Surgical site infection (SSI) is the most prevalent type of health care-associated infection that leads to increased morbidity and mortality and a significant economic burden. Effective prevention of SSI relies on surgeons strictly following the latest clinical guidelines and implementing standardized and multilevel intervention strategies. However, the frequent updates to clinical guidelines render the processes of acquisition and interpretation quite time-consuming and intricate. The emergence of artificial intelligence (AI) chatbots offers both possibilities and challenges to address these issues in the surgical field.

OBJECTIVE

This study aimed to test the multidimensional capability of state-of-the-art AI chatbots for generating proper recommendations and corresponding rationales concordant with the global guideline for the prevention of SSI.

METHODS

Referred by other authoritative guidelines, recommendations and corresponding rationales from the 2018 World Health Organization global guidelines were refined and selected as benchmarks. Then, they were rephrased into a combined format of closed-ended queries for recommendations and open-ended queries for corresponding rationales, whereafter input into ChatGPT-4o (OpenAI), OpenAI-o1 (OpenAI), Claude 3.5 Sonnet (Anthropic), and Gemini 1.5 Pro (Google) 3 times. All responses were individually evaluated in 10 evaluation metrics based on the QUEST dimensions by 4 multidisciplinary senior surgeons using a 5-point Likert scale. The multidimensional performances among chatbots were compared, and the interrater agreements were calculated.

RESULTS

A total of 300 responses to 25 queries were generated by the 4 chatbots. The interrater agreements of the evaluators ranged from moderate to good (0.54-0.87). In response to recommendations, the average accuracy, consistency, and harm scores for all chatbots were 4.03 (SD 1.09), 4.07 (SD 0.88), and 4.29 (SD 1.01), respectively. In responses for rationales, 4 subdimensions, including harm (mean 4.22, SD 0.97), relevance (mean 4.15, SD 0.83), fabrication and falsification (mean 4.12, SD 1.02), and understanding and reasoning (mean 4.04, SD 0.92) averagely scored ≥4. In contrast, consistency (mean 3.94, SD 0.72), clarity (mean 3.94, SD 0.89), comprehensiveness (mean 3.85, SD 0.83), and accuracy (mean 3.74, SD 0.91) performed at a moderate level. For the whole responses, the average self-awareness and trust and confidence scores for all chatbots were 3.84 (SD 0.89) and 3.88 (SD 0.91), respectively. Based on the average scores of the subdimensions, Claude 3.5 Sonnet and ChatGPT-4o were the top 2 outperformed models.

CONCLUSIONS

The performance of AI chatbots in providing responses regarding well-established global guidelines in the prevention of SSI was acceptable, demonstrating immense potential in clinical applications. Nonetheless, a critical issue is the necessity of enhancing the stability of chatbots, as inaccurate responses can lead to severe consequences for SSI. Despite its limitations, it is anticipated that AI will trigger far-reaching changes in how clinicians access and use medical information.

摘要

背景

手术部位感染(SSI)是医疗保健相关感染中最常见的类型,会导致发病率和死亡率上升,并带来巨大的经济负担。有效预防SSI依赖于外科医生严格遵循最新临床指南并实施标准化的多层次干预策略。然而,临床指南的频繁更新使得获取和解读过程既耗时又复杂。人工智能(AI)聊天机器人的出现为解决外科领域的这些问题带来了机遇和挑战。

目的

本研究旨在测试最先进的AI聊天机器人在生成符合全球SSI预防指南的适当建议及相应理由方面的多维能力。

方法

参考其他权威指南,对2018年世界卫生组织全球指南中的建议及相应理由进行提炼和筛选,作为基准。然后,将它们重新表述为针对建议的封闭式查询和针对相应理由的开放式查询的组合形式,之后分3次输入ChatGPT-4o(OpenAI)、OpenAI-o1(OpenAI)、Claude 3.5 Sonnet(Anthropic)和Gemini 1.5 Pro(Google)。4位多学科资深外科医生根据QUEST维度,使用5点李克特量表对所有回复进行10项评估指标的单独评估。比较聊天机器人之间的多维性能,并计算评估者之间的一致性。

结果

4个聊天机器人共生成了对25个查询的300条回复。评估者之间的一致性从中度到良好(0.54 - 0.87)不等。对于建议,所有聊天机器人的平均准确性、一致性和危害得分分别为4.03(标准差1.09)、4.07(标准差0.88)和4.29(标准差1.01)。对于理由回复,包括危害(平均4.22,标准差0.97)、相关性(平均4.15,标准差0.83)、编造和伪造(平均4.12,标准差1.02)以及理解和推理(平均4.04,标准差0.92)在内的4个维度平均得分≥4。相比之下,一致性(平均3.94,标准差0.72)、清晰度(平均3.94,标准差0.89)、全面性(平均3.85,标准差0.83)和准确性(平均3.74,标准差0.91)表现中等。对于整体回复,所有聊天机器人的平均自我意识以及信任和信心得分分别为3.84(标准差0.89)和3.88(标准差0.91)。基于各维度的平均得分,Claude 3.5 Sonnet和ChatGPT-4o是表现最佳的前两个模型。

结论

AI聊天机器人在提供关于成熟的全球SSI预防指南的回复方面表现尚可,在临床应用中显示出巨大潜力。尽管如此,一个关键问题是需要提高聊天机器人的稳定性,因为不准确的回复可能会给SSI带来严重后果。尽管存在局限性,但预计AI将引发临床医生获取和使用医学信息方式的深远变革。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e22d/12313333/a4ae45999a76/jmir-v27-e75567-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验