Alonso Sousa Santiago, Bukhari Syed Saad Ul Hassan, Steagall Paulo Vinicius, Bęczkowski Paweł M, Giuliano Antonio, Flay Kate J
Department of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, China.
Centre for Animal Health and Welfare, City University of Hong Kong, Kowloon, Hong Kong SAR, China.
Front Vet Sci. 2025 Aug 26;12:1616566. doi: 10.3389/fvets.2025.1616566. eCollection 2025.
The integration of artificial intelligence, particularly large language models (LLMs), into veterinary education and practice presents promising opportunities, yet their performance in veterinary-specific contexts remains understudied. This research comparatively evaluated the performance of nine advanced LLMs (ChatGPT o1Pro, ChatGPT 4o, ChatGPT 4.5, Grok 3, Gemini 2, Copilot, DeepSeek R1, Qwen 2.5 Max, and Kimi 1.5) on 250 multiple-choice questions (MCQs) sourced from a veterinary undergraduate final qualifying examination. Questions spanned various species, clinical topics and reasoning stages, and included both text-based and image-based formats. ChatGPT o1Pro and ChatGPT 4.5 achieved the highest overall performance, with correct response rates of 90.4 and 90.8% respectively, demonstrating strong agreement with the gold standard across most categories, while Kimi 1.5 showed the lowest performance at 64.8%. Performance consistently declined with increased question difficulty and was generally lower for image-based than text-based questions. OpenAI models excelled in visual interpretation compared to previous studies. Disparities in performance were observed across specific clinical reasoning stages and veterinary subdomains, highlighting areas for targeted improvement. This study underscores the promising role of LLMs as supportive tools for quality assurance in veterinary assessment design and indicates key factors influencing their performance, including question difficulty, format, and domain-specific training data.
将人工智能,特别是大语言模型(LLMs)整合到兽医教育和实践中带来了充满希望的机遇,然而它们在兽医特定环境中的表现仍未得到充分研究。本研究对九个先进的大语言模型(ChatGPT o1Pro、ChatGPT 4o、ChatGPT 4.5、Grok 3、Gemini 2、Copilot、DeepSeek R1、Qwen 2.5 Max和Kimi 1.5)在250道从兽医本科毕业资格考试中选取的多项选择题(MCQs)上的表现进行了比较评估。问题涵盖了各种物种、临床主题和推理阶段,包括基于文本和基于图像的格式。ChatGPT o1Pro和ChatGPT 4.5的总体表现最佳,正确回答率分别为90.4%和90.8%,在大多数类别中与黄金标准表现出高度一致,而Kimi 1.5的表现最差,为64.8%。随着问题难度的增加,表现持续下降,基于图像的问题的表现通常低于基于文本的问题。与之前的研究相比,OpenAI模型在视觉解释方面表现出色。在特定的临床推理阶段和兽医子领域观察到了表现差异,突出了有针对性改进的领域。本研究强调了大语言模型作为兽医评估设计中质量保证支持工具的潜在作用,并指出了影响其表现的关键因素,包括问题难度、格式和特定领域的训练数据。