生成式人工智能模型在预测儿科急诊严重程度指数水平中的评估

Evaluation of Generative Artificial Intelligence Models in Predicting Pediatric Emergency Severity Index Levels.

作者信息

Ho Brandon, Lu Meng, Wang Xuan, Butler Russell, Park Joshua, Ren Dennis

机构信息

Department of Computer Science, Virginia Tech, Falls Church, VA.

University of California Davis School of Medicine, Sacramento, CA.

出版信息

Pediatr Emerg Care. 2025 Apr 1;41(4):251-255. doi: 10.1097/PEC.0000000000003315. Epub 2025 Jan 7.

DOI:10.1097/PEC.0000000000003315

PMID:39761573

Abstract

OBJECTIVE

Evaluate the accuracy and reliability of various generative artificial intelligence (AI) models (ChatGPT-3.5, ChatGPT-4.0, T5, Llama-2, Mistral-Large, and Claude-3 Opus) in predicting Emergency Severity Index (ESI) levels for pediatric emergency department patients and assess the impact of medically oriented fine-tuning.

METHODS

Seventy pediatric clinical vignettes from the ESI Handbook version 4 were used as the gold standard. Each AI model predicted the ESI level for each vignette. Performance metrics, including sensitivity, specificity, and F1 score, were calculated. Reliability was assessed by repeating the tests and measuring the interrater reliability using Fleiss kappa. Paired t tests were used to compare the models before and after fine-tuning.

RESULTS

Claude-3 Opus achieved the highest performance amongst the untrained models with a sensitivity of 80.6% (95% confidence interval [CI]: 63.6-90.7), specificity of 91.3% (95% CI: 83.8-99), and an F1 score of 73.9% (95% CI: 58.9-90.7). After fine-tuning, the GPT-4.0 model showed statistically significant improvement with a sensitivity of 77.1% (95% CI: 60.1-86.5), specificity of 92.5% (95% CI: 89.5-97.4), and an F1 score of 74.6% (95% CI: 63.9-83.8, P < 0.04). Reliability analysis revealed high agreement for Claude-3 Opus (Fleiss κ: 0.85), followed by Mistral-Large (Fleiss κ: 0.79) and trained GPT-4.0 (Fleiss κ: 0.67). Training improved the reliability of GPT models ( P < 0.001).

CONCLUSIONS

Generative AI models demonstrate promising accuracy in predicting pediatric ESI levels, with fine-tuning significantly enhancing their performance and reliability. These findings suggest that AI could serve as a valuable tool in pediatric triage.

摘要

目的

评估各种生成式人工智能（AI）模型（ChatGPT - 3.5、ChatGPT - 4.0、T5、Llama - 2、Mistral - Large和Claude - 3 Opus）预测儿科急诊科患者急诊严重程度指数（ESI）水平的准确性和可靠性，并评估医学导向微调的影响。

方法

将来自《ESI手册》第4版的70个儿科临床病例 vignettes 用作金标准。每个AI模型预测每个病例 vignette 的ESI水平。计算包括敏感性、特异性和F1分数在内的性能指标。通过重复测试并使用Fleiss kappa测量评分者间信度来评估可靠性。使用配对t检验比较微调前后的模型。

结果

在未经训练的模型中，Claude - 3 Opus表现最佳，敏感性为80.6%（95%置信区间[CI]：63.6 - 90.7），特异性为91.3%（95% CI：83.8 - 99），F1分数为73.9%（95% CI：58.9 - 90.7）。微调后，GPT - 4.0模型显示出统计学上的显著改善，敏感性为77.1%（95% CI：60.1 - 86.5），特异性为92.5%（95% CI：89.5 - 97.4），F1分数为74.6%（95% CI：63.9 - 83.8，P < 0.04）。可靠性分析显示Claude - 3 Opus的一致性较高（Fleiss κ：0.85），其次是Mistral - Large（Fleiss κ：0.79）和经过训练的GPT - 4.0（Fleiss κ：0.67）。训练提高了GPT模型的可靠性（P < 0.001）。

结论

生成式AI模型在预测儿科ESI水平方面显示出有前景的准确性，微调显著提高了它们的性能和可靠性。这些发现表明AI可作为儿科分诊中的一种有价值的工具。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

生成式人工智能模型在预测儿科急诊严重程度指数水平中的评估

Evaluation of Generative Artificial Intelligence Models in Predicting Pediatric Emergency Severity Index Levels.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

生成式人工智能模型在预测儿科急诊严重程度指数水平中的评估

Evaluation of Generative Artificial Intelligence Models in Predicting Pediatric Emergency Severity Index Levels.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献