• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT-4和Gemini Ultra 1.0在紧急医疗服务胸痛呼叫质量保证审查中的表现。

The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls.

作者信息

Brant-Zawadzki Graham, Klapthor Brent, Ryba Chris, Youngquist Drew C, Burton Brooke, Palatinus Helen, Youngquist Scott T

机构信息

Department of Emergency Medicine, University of Utah, Salt Lake City, Utah.

Unified Fire Authority, Salt Lake City, Utah.

出版信息

Prehosp Emerg Care. 2025;29(3):210-217. doi: 10.1080/10903127.2024.2376757. Epub 2024 Jul 22.

DOI:10.1080/10903127.2024.2376757
PMID:38976859
Abstract

OBJECTIVES

This study assesses the feasibility, inter-rater reliability, and accuracy of using OpenAI's ChatGPT-4 and Google's Gemini Ultra large language models (LLMs), for Emergency Medical Services (EMS) quality assurance. The implementation of these LLMs for EMS quality assurance has the potential to significantly reduce the workload on medical directors and quality assurance staff by automating aspects of the processing and review of patient care reports. This offers the potential for more efficient and accurate identification of areas requiring improvement, thereby potentially enhancing patient care outcomes.

METHODS

Two expert human reviewers, ChatGPT GPT-4, and Gemini Ultra assessed and rated 150 consecutively sampled and anonymized prehospital records from 2 large urban EMS agencies for adherence to 2020 National Association of State EMS metrics for cardiac care. We evaluated the accuracy of scoring, inter-rater reliability, and review efficiency. The inter-rater reliability for the dichotomous outcome of each EMS metric was measured using the kappa statistic.

RESULTS

Human reviewers showed high interrater reliability, with 91.2% agreement and a kappa coefficient 0.782 (0.654-0.910). ChatGPT-4 achieved substantial agreement with human reviewers in EKG documentation and aspirin administration (76.2% agreement, kappa coefficient 0.401 (0.334-0.468), but performance varied across other metrics. Gemini Ultra's evaluation was discontinued due to poor performance. No significant differences were observed in median review times: 01:28 min (IQR 1:12 - 1:51 min) per human chart review, 01:24 min (IQR 01:09 - 01:53 min) per ChatGPT-4 chart review ( = 0.46), and 01:50 min (IQR 01:10-03:34 min) per Gemini Ultra review ( = 0.06).

CONCLUSIONS

Large language models demonstrate potential in supporting quality assurance by effectively and objectively extracting data elements. However, their accuracy in interpreting non-standardized and time-sensitive details remains inferior to human evaluators. Our findings suggest that current LLMs may best offer supplemental support to the human review processes, but their current value remains limited. Enhancements in LLM training and integration are recommended for improved and more reliable performance in the quality assurance processes.

摘要

目的

本研究评估使用OpenAI的ChatGPT-4和谷歌的Gemini Ultra大型语言模型(LLM)进行紧急医疗服务(EMS)质量保证的可行性、评分者间信度和准确性。将这些LLM应用于EMS质量保证,有可能通过自动执行患者护理报告处理和审查的各个方面,显著减轻医疗主任和质量保证人员的工作量。这为更高效、准确地识别需要改进的领域提供了可能,从而有可能改善患者护理结果。

方法

两名专家人工评审员、ChatGPT GPT-4和Gemini Ultra对来自两家大型城市EMS机构的150份连续抽样并匿名的院前记录进行评估和评分,以确定其是否符合2020年国家州EMS协会心脏护理指标。我们评估了评分的准确性、评分者间信度和审查效率。使用kappa统计量测量每个EMS指标二分结果中的评分者间信度。

结果

人工评审员显示出较高的评分者间信度,一致性为91.2%,kappa系数为0.782(0.654 - 0.910)。ChatGPT-4在心电图记录和阿司匹林给药方面与人工评审员达成了实质性一致(一致性为76.2%,kappa系数为0.401(0.334 - 0.468)),但在其他指标上表现各异。由于性能不佳,Gemini Ultra的评估被中断。在中位数审查时间方面未观察到显著差异:每人审查一份图表的时间为01:28分钟(四分位间距1:12 - 1:51分钟),ChatGPT-4审查一份图表的时间为01:24分钟(四分位间距01:09 - 01:53分钟)(P = 0.46),Gemini Ultra审查一份图表的时间为01:50分钟(四分位间距01:10 - 03:34分钟)(P = 0.06)。

结论

大型语言模型在通过有效且客观地提取数据元素来支持质量保证方面显示出潜力。然而,它们在解释非标准化和对时间敏感的细节方面的准确性仍不如人工评估者。我们的研究结果表明,当前的LLM可能最适合为人工审查过程提供补充支持,但其当前价值仍然有限。建议改进LLM训练和整合,以在质量保证过程中实现更好、更可靠的性能。

相似文献

1
The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls.ChatGPT-4和Gemini Ultra 1.0在紧急医疗服务胸痛呼叫质量保证审查中的表现。
Prehosp Emerg Care. 2025;29(3):210-217. doi: 10.1080/10903127.2024.2376757. Epub 2024 Jul 22.
2
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较:大型语言模型、ChatGPT 和未经训练的急诊医生:一项对比研究。
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.
3
ChatGPT vs. Gemini: Comparative accuracy and efficiency in Lung-RADS score assignment from radiology reports.ChatGPT与Gemini:在根据放射学报告进行Lung-RADS评分分配中的准确性和效率比较
Clin Imaging. 2025 May;121:110455. doi: 10.1016/j.clinimag.2025.110455. Epub 2025 Mar 13.
4
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力:ChatGPT、谷歌巴德和微软必应的比较研究
Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.
5
Comparative Analysis of ChatGPT-4o and Gemini Advanced Performance on Diagnostic Radiology In-Training Exams.ChatGPT-4o与Gemini在放射诊断学培训考试中的性能对比分析
Cureus. 2025 Mar 20;17(3):e80874. doi: 10.7759/cureus.80874. eCollection 2025 Mar.
6
Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge.Gemini人工智能与ChatGPT对比:与眼科住院医师一起对医学知识进行的全面考察
Graefes Arch Clin Exp Ophthalmol. 2025 Feb;263(2):527-536. doi: 10.1007/s00417-024-06625-4. Epub 2024 Sep 15.
7
Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.在回答患者问题方面,大型语言模型聊天机器人的表现是否优于成熟的患者信息资源?一项关于黑色素瘤的比较研究。
Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.
8
Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models.利用大语言模型提高美国甲状腺结节诊断的一致性和准确性。
Radiology. 2024 Mar;310(3):e232255. doi: 10.1148/radiol.232255.
9
Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced.大型语言模型在高级头颈部恶性肿瘤管理中的可靠性:ChatGPT 4 与 Gemini Advanced 之间的比较。
Eur Arch Otorhinolaryngol. 2024 Sep;281(9):5001-5006. doi: 10.1007/s00405-024-08746-2. Epub 2024 May 25.
10
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试:ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析
Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

引用本文的文献

1
Enhancing Patient Comprehension of Glomerular Disease Treatments Using ChatGPT.使用ChatGPT提高患者对肾小球疾病治疗的理解
Healthcare (Basel). 2024 Dec 31;13(1):57. doi: 10.3390/healthcare13010057.