Brant-Zawadzki Graham, Klapthor Brent, Ryba Chris, Youngquist Drew C, Burton Brooke, Palatinus Helen, Youngquist Scott T
Department of Emergency Medicine, University of Utah, Salt Lake City, Utah.
Unified Fire Authority, Salt Lake City, Utah.
Prehosp Emerg Care. 2025;29(3):210-217. doi: 10.1080/10903127.2024.2376757. Epub 2024 Jul 22.
This study assesses the feasibility, inter-rater reliability, and accuracy of using OpenAI's ChatGPT-4 and Google's Gemini Ultra large language models (LLMs), for Emergency Medical Services (EMS) quality assurance. The implementation of these LLMs for EMS quality assurance has the potential to significantly reduce the workload on medical directors and quality assurance staff by automating aspects of the processing and review of patient care reports. This offers the potential for more efficient and accurate identification of areas requiring improvement, thereby potentially enhancing patient care outcomes.
Two expert human reviewers, ChatGPT GPT-4, and Gemini Ultra assessed and rated 150 consecutively sampled and anonymized prehospital records from 2 large urban EMS agencies for adherence to 2020 National Association of State EMS metrics for cardiac care. We evaluated the accuracy of scoring, inter-rater reliability, and review efficiency. The inter-rater reliability for the dichotomous outcome of each EMS metric was measured using the kappa statistic.
Human reviewers showed high interrater reliability, with 91.2% agreement and a kappa coefficient 0.782 (0.654-0.910). ChatGPT-4 achieved substantial agreement with human reviewers in EKG documentation and aspirin administration (76.2% agreement, kappa coefficient 0.401 (0.334-0.468), but performance varied across other metrics. Gemini Ultra's evaluation was discontinued due to poor performance. No significant differences were observed in median review times: 01:28 min (IQR 1:12 - 1:51 min) per human chart review, 01:24 min (IQR 01:09 - 01:53 min) per ChatGPT-4 chart review ( = 0.46), and 01:50 min (IQR 01:10-03:34 min) per Gemini Ultra review ( = 0.06).
Large language models demonstrate potential in supporting quality assurance by effectively and objectively extracting data elements. However, their accuracy in interpreting non-standardized and time-sensitive details remains inferior to human evaluators. Our findings suggest that current LLMs may best offer supplemental support to the human review processes, but their current value remains limited. Enhancements in LLM training and integration are recommended for improved and more reliable performance in the quality assurance processes.
本研究评估使用OpenAI的ChatGPT-4和谷歌的Gemini Ultra大型语言模型(LLM)进行紧急医疗服务(EMS)质量保证的可行性、评分者间信度和准确性。将这些LLM应用于EMS质量保证,有可能通过自动执行患者护理报告处理和审查的各个方面,显著减轻医疗主任和质量保证人员的工作量。这为更高效、准确地识别需要改进的领域提供了可能,从而有可能改善患者护理结果。
两名专家人工评审员、ChatGPT GPT-4和Gemini Ultra对来自两家大型城市EMS机构的150份连续抽样并匿名的院前记录进行评估和评分,以确定其是否符合2020年国家州EMS协会心脏护理指标。我们评估了评分的准确性、评分者间信度和审查效率。使用kappa统计量测量每个EMS指标二分结果中的评分者间信度。
人工评审员显示出较高的评分者间信度,一致性为91.2%,kappa系数为0.782(0.654 - 0.910)。ChatGPT-4在心电图记录和阿司匹林给药方面与人工评审员达成了实质性一致(一致性为76.2%,kappa系数为0.401(0.334 - 0.468)),但在其他指标上表现各异。由于性能不佳,Gemini Ultra的评估被中断。在中位数审查时间方面未观察到显著差异:每人审查一份图表的时间为01:28分钟(四分位间距1:12 - 1:51分钟),ChatGPT-4审查一份图表的时间为01:24分钟(四分位间距01:09 - 01:53分钟)(P = 0.46),Gemini Ultra审查一份图表的时间为01:50分钟(四分位间距01:10 - 03:34分钟)(P = 0.06)。
大型语言模型在通过有效且客观地提取数据元素来支持质量保证方面显示出潜力。然而,它们在解释非标准化和对时间敏感的细节方面的准确性仍不如人工评估者。我们的研究结果表明,当前的LLM可能最适合为人工审查过程提供补充支持,但其当前价值仍然有限。建议改进LLM训练和整合,以在质量保证过程中实现更好、更可靠的性能。