Wulcan Judit M, Jacques Kevin L, Lee Mary Ann, Kovacs Samantha L, Dausend Nicole, Prince Lauren E, Wulcan Jonatan, Marsilio Sina, Keller Stefan M
Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United States.
College of Veterinary Medicine and Biomedical Sciences, James L. Voss Veterinary Teaching Hospital, Colorado State University, Fort Collins, CO, United States.
Front Vet Sci. 2025 Jan 16;11:1490030. doi: 10.3389/fvets.2024.1490030. eCollection 2024.
Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of hyperparameter settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and by investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. When compared to the majority opinion of human respondents, GPT-4o demonstrated 96.9% sensitivity [interquartile range (IQR) 92.9-99.3%], 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value (IQR 70.8-84.6%), 99.5% negative predictive value (IQR 99.0-99.9%), 84.4% F1 score (IQR 77.3-90.4%), and 96.3% balanced accuracy (IQR 95.0-97.9%). The performance of GPT-4o was significantly better than that of its predecessor, GPT-3.5 Turbo, particularly with respect to sensitivity where GPT-3.5 Turbo only achieved 81.7% (IQR 78.9-84.8%). GPT-4o demonstrated greater reproducibility than human pairs, with an average Cohen's kappa of 0.98 (IQR 0.98-0.99) compared to 0.80 (IQR 0.78-0.81) with humans. Most GPT-4o errors occurred in instances where humans disagreed [35/43 errors (81.4%)], suggesting that these errors were more likely caused by ambiguity of the EHR than explicit model faults. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction, but requires validation for the intended setting to ensure accuracy and reliability.
大语言模型(LLMs)可以从兽医电子健康记录(EHRs)中提取信息,但此前尚未评估不同模型之间的性能差异、超参数设置的影响以及文本模糊性的影响。本研究通过比较GPT-4 omni(GPT-4o)和GPT-3.5 Turbo在不同条件下的性能,并研究人类观察者间一致性与大语言模型错误之间的关系,填补了这些空白。大语言模型和五名人类被要求从一家兽医转诊医院的250份电子健康记录中识别与猫慢性肠病相关的六种临床症状。与人类受访者的多数意见相比,GPT-4o的灵敏度为96.9%[四分位间距(IQR)92.9 - 99.3%],特异度为97.6%(IQR 96.5 - 98.5%),阳性预测值为80.7%(IQR 70.8 - 84.6%),阴性预测值为99.5%(IQR 99.0 - 99.9%),F1分数为84.4%(IQR 77.3 - 90.4%),平衡准确度为96.3%(IQR 95.0 - 97.9%)。GPT-4o的性能明显优于其前身GPT-3.5 Turbo,特别是在灵敏度方面,GPT-3.5 Turbo仅达到81.7%(IQR 78.9 - 84.8%)。GPT-4o表现出比人类配对更高的可重复性,平均科恩kappa系数为0.98(IQR 0.98 - 0.99),而人类为0.80(IQR 0.78 - 0.81)。大多数GPT-4o错误发生在人类意见不一致的情况下[43个错误中有35个(81.4%)],这表明这些错误更可能是由电子健康记录的模糊性而非模型的明确故障引起的。使用GPT-4o自动从兽医电子健康记录中提取信息是手动提取的可行替代方案,但需要针对预期设置进行验证,以确保准确性和可靠性。