使用大语言模型对黄蜂蜇伤进行临床管理：横断面评估研究

Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.

作者信息

Pan Wei, Zhang Shuman, Wang Yonghong, Quan Zhenglin, Zhu Yanxia, Fang Zhicheng, Yang Xianyi

机构信息

Department of Emergency Medicine, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China.

The Intensive Care Unit, The First Dongguan Affiliated Hospital, Guangdong Medical University, Dongguan, Guangdong, China.

出版信息

J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.

DOI:10.2196/67489

PMID:40466102

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12177424/

Abstract

BACKGROUND

Wasp stings are a significant public health concern in many parts of the world, particularly in tropical and subtropical regions. The venom of wasps contains a variety of bioactive compounds that can lead to a wide range of clinical effects, from mild localized pain and swelling to severe, life-threatening allergic reactions, such as anaphylaxis. With the rapid development of artificial intelligence (AI) technologies, large language models (LLMs) are increasingly being used in health care, including emergency medicine and toxicology. These models have the potential to assist health care professionals in making fast and informed clinical decisions. This study aimed to assess the performance of 4 leading LLMs-ERNIE Bot 3.5 (Baidu), ERNIE Bot 4.0 (Baidu), Claude Pro (Anthropic), and ChatGPT 4.0-in managing wasp sting cases, with a focus on their accuracy, comprehensiveness, and decision-making abilities.

OBJECTIVE

The objective of this research was to systematically evaluate and compare the capabilities of the 4 LLMs in the context of wasp sting management. This involved analyzing their responses to a series of standardized questions and real-world clinical scenarios. The study aimed to determine which LLMs provided the most accurate, complete, and clinically relevant information for the management of wasp stings.

METHODS

This study used a cross-sectional design, creating 50 standardized questions that covered 10 key domains in the management of wasp stings, along with 20 real-world clinical case scenarios. Responses from the 4 LLMs were independently evaluated by 8 domain experts, who rated them on a 5-point Likert scale based on accuracy, completeness, and usefulness in clinical decision-making. Statistical comparisons between the models were made using the Wilcoxon signed-rank test, and the consistency of expert ratings was assessed using the Kendall coefficient of concordance.

RESULTS

Claude Pro achieved the highest average score of 4.7 (SD 0.603) out of 5, followed closely by ChatGPT 4.0 with a score of 4.5. ERNIE Bot 4.0 and ERNIE Bot 3.5 received average scores of 4 (SD 0.600) and 3.8, respectively. In analyzing the 20 complex clinical cases, Claude Pro significantly outperformed ERNIE Bot 3.5, particularly in areas such as managing complications and assessing the severity of reactions (P<.001). The expert ratings showed moderate agreement (Kendall W=0.67), indicating that the assessments were consistently reliable.

CONCLUSIONS

The results of this study suggest that Claude Pro and ChatGPT 4.0 are highly capable of providing accurate and comprehensive support for the clinical management of wasp stings, particularly in complex decision-making scenarios. These findings support the increasing role of AI in emergency and toxicological medicine and suggest that the choice of AI tool should be based on the specific needs of the clinical situation, ensuring that the most appropriate model is selected for different health care applications.

摘要

背景

黄蜂蜇伤是世界上许多地区，特别是热带和亚热带地区的一个重大公共卫生问题。黄蜂毒液含有多种生物活性化合物，可导致广泛的临床效应，从轻微的局部疼痛和肿胀到严重的、危及生命的过敏反应，如过敏症。随着人工智能（AI）技术的迅速发展，大语言模型（LLMs）越来越多地应用于医疗保健领域，包括急诊医学和毒理学。这些模型有潜力协助医疗保健专业人员做出快速且明智的临床决策。本研究旨在评估4种领先的大语言模型——文心一言3.5（百度）、文心一言4.0（百度）、Claude Pro（Anthropic）和ChatGPT 4.0——在处理黄蜂蜇伤病例方面的表现，重点关注其准确性、全面性和决策能力。

目的

本研究的目的是系统评估和比较这4种大语言模型在黄蜂蜇伤处理方面的能力。这包括分析它们对一系列标准化问题和实际临床场景的回答。该研究旨在确定哪种大语言模型为黄蜂蜇伤的处理提供最准确、完整且与临床相关的信息。

方法

本研究采用横断面设计，创建了50个标准化问题，涵盖黄蜂蜇伤处理的10个关键领域，以及20个实际临床病例场景。4种大语言模型的回答由8位领域专家独立评估，专家根据准确性、完整性和在临床决策中的有用性，采用5点李克特量表对其进行评分。使用Wilcoxon符号秩检验对模型之间进行统计比较，并使用肯德尔和谐系数评估专家评分的一致性。

结果

Claude Pro在5分制中获得了最高平均分4.7（标准差0.603），紧随其后的是ChatGPT 4.0，得分为4.5。文心一言4.0和文心一言^{3.5}的平均分分别为4（标准差0.600）和3.8。在分析20个复杂临床病例时，Claude Pro显著优于文心一言3.5，特别是在处理并发症和评估反应严重程度等方面（P<0.001）。专家评分显示出中度一致性（肯德尔W=0.67），表明评估一直是可靠的。