• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型在兽医本科多项选择题考试中的表现:一项比较评估。

Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation.

作者信息

Alonso Sousa Santiago, Bukhari Syed Saad Ul Hassan, Steagall Paulo Vinicius, Bęczkowski Paweł M, Giuliano Antonio, Flay Kate J

机构信息

Department of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, China.

Centre for Animal Health and Welfare, City University of Hong Kong, Kowloon, Hong Kong SAR, China.

出版信息

Front Vet Sci. 2025 Aug 26;12:1616566. doi: 10.3389/fvets.2025.1616566. eCollection 2025.

DOI:10.3389/fvets.2025.1616566
PMID:40933532
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12418517/
Abstract

The integration of artificial intelligence, particularly large language models (LLMs), into veterinary education and practice presents promising opportunities, yet their performance in veterinary-specific contexts remains understudied. This research comparatively evaluated the performance of nine advanced LLMs (ChatGPT o1Pro, ChatGPT 4o, ChatGPT 4.5, Grok 3, Gemini 2, Copilot, DeepSeek R1, Qwen 2.5 Max, and Kimi 1.5) on 250 multiple-choice questions (MCQs) sourced from a veterinary undergraduate final qualifying examination. Questions spanned various species, clinical topics and reasoning stages, and included both text-based and image-based formats. ChatGPT o1Pro and ChatGPT 4.5 achieved the highest overall performance, with correct response rates of 90.4 and 90.8% respectively, demonstrating strong agreement with the gold standard across most categories, while Kimi 1.5 showed the lowest performance at 64.8%. Performance consistently declined with increased question difficulty and was generally lower for image-based than text-based questions. OpenAI models excelled in visual interpretation compared to previous studies. Disparities in performance were observed across specific clinical reasoning stages and veterinary subdomains, highlighting areas for targeted improvement. This study underscores the promising role of LLMs as supportive tools for quality assurance in veterinary assessment design and indicates key factors influencing their performance, including question difficulty, format, and domain-specific training data.

摘要

将人工智能,特别是大语言模型(LLMs)整合到兽医教育和实践中带来了充满希望的机遇,然而它们在兽医特定环境中的表现仍未得到充分研究。本研究对九个先进的大语言模型(ChatGPT o1Pro、ChatGPT 4o、ChatGPT 4.5、Grok 3、Gemini 2、Copilot、DeepSeek R1、Qwen 2.5 Max和Kimi 1.5)在250道从兽医本科毕业资格考试中选取的多项选择题(MCQs)上的表现进行了比较评估。问题涵盖了各种物种、临床主题和推理阶段,包括基于文本和基于图像的格式。ChatGPT o1Pro和ChatGPT 4.5的总体表现最佳,正确回答率分别为90.4%和90.8%,在大多数类别中与黄金标准表现出高度一致,而Kimi 1.5的表现最差,为64.8%。随着问题难度的增加,表现持续下降,基于图像的问题的表现通常低于基于文本的问题。与之前的研究相比,OpenAI模型在视觉解释方面表现出色。在特定的临床推理阶段和兽医子领域观察到了表现差异,突出了有针对性改进的领域。本研究强调了大语言模型作为兽医评估设计中质量保证支持工具的潜在作用,并指出了影响其表现的关键因素,包括问题难度、格式和特定领域的训练数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0bc/12418517/7c10bffcd454/fvets-12-1616566-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0bc/12418517/8ab2e7ba4899/fvets-12-1616566-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0bc/12418517/e262645f7aa8/fvets-12-1616566-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0bc/12418517/3d365b57233f/fvets-12-1616566-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0bc/12418517/7c10bffcd454/fvets-12-1616566-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0bc/12418517/8ab2e7ba4899/fvets-12-1616566-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0bc/12418517/e262645f7aa8/fvets-12-1616566-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0bc/12418517/3d365b57233f/fvets-12-1616566-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0bc/12418517/7c10bffcd454/fvets-12-1616566-g004.jpg

相似文献

1
Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation.大型语言模型在兽医本科多项选择题考试中的表现:一项比较评估。
Front Vet Sci. 2025 Aug 26;12:1616566. doi: 10.3389/fvets.2025.1616566. eCollection 2025.
2
Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education.ChatGPT、Gemini与急诊医学实习最后一年学生在回答多项选择题方面的表现比较:人工智能在医学教育中的应用启示
Int J Emerg Med. 2025 Aug 7;18(1):146. doi: 10.1186/s12245-025-00949-6.
3
A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.牙种植学中大型语言模型的多维性能评估:ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较
BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.
4
Comparative Analysis of LLMs' Performance On a Practice Radiography Certification Exam.大语言模型在放射实践认证考试中的性能比较分析
Radiol Technol. 2025 May-Jun;96(5):334-342.
5
Assessing the Role of Large Language Models Between ChatGPT and DeepSeek in Asthma Education for Bilingual Individuals: Comparative Study.评估ChatGPT和DeepSeek之间的大型语言模型在双语个体哮喘教育中的作用:比较研究
JMIR Med Inform. 2025 Aug 13;13:e65365. doi: 10.2196/65365.
6
Artificial intelligence in radiology examinations: a psychometric comparison of question generation methods.放射学检查中的人工智能:问题生成方法的心理测量学比较
Diagn Interv Radiol. 2025 Jul 21. doi: 10.4274/dir.2025.253407.
7
DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning.在双语复杂眼科推理方面,DeepSeek-R1的表现优于Gemini 2.0 Pro、OpenAI的o1和o3-mini。
Adv Ophthalmol Pract Res. 2025 May 9;5(3):189-195. doi: 10.1016/j.aopr.2025.05.001. eCollection 2025 Aug-Sep.
8
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
9
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
10
Diagnostic performance of newly developed large language models in critical illness cases: A comparative study.新开发的大语言模型在危重症病例中的诊断性能:一项比较研究。
Int J Med Inform. 2025 Dec;204:106088. doi: 10.1016/j.ijmedinf.2025.106088. Epub 2025 Aug 23.

本文引用的文献

1
Large language models for diabetes training: a prospective study.用于糖尿病培训的大语言模型:一项前瞻性研究。
Sci Bull (Beijing). 2025 Mar 30;70(6):934-942. doi: 10.1016/j.scib.2025.01.034. Epub 2025 Jan 27.
2
Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records.用于从兽医电子健康记录中提取信息的GPT-4全知模型的分类性能和可重复性
Front Vet Sci. 2025 Jan 16;11:1490030. doi: 10.3389/fvets.2024.1490030. eCollection 2024.
3
A Survey of Veterinary Student Perceptions on Integrating ChatGPT in Veterinary Education Through AI-Driven Exercises.
一项关于兽医学生对通过人工智能驱动的练习将ChatGPT融入兽医教育的看法的调查。
J Vet Med Educ. 2024 Dec 13:e20240075. doi: 10.3138/jvme-2024-0075.
4
Enhancements in artificial intelligence for medical examinations: A leap from ChatGPT 3.5 to ChatGPT 4.0 in the FRCS trauma & orthopaedics examination.医学检查中人工智能的进步:从ChatGPT 3.5到ChatGPT 4.0在英国皇家外科医学院创伤与骨科考试中的飞跃。
Surgeon. 2025 Feb;23(1):13-17. doi: 10.1016/j.surge.2024.11.008. Epub 2024 Nov 29.
5
Measures for Quality Assurance of Electronic Examinations in a Veterinary Medical Curriculum.兽医学课程电子考试质量保证措施
J Vet Med Educ. 2024 Dec;51(6):762-776. doi: 10.3138/jvme-2023-0061. Epub 2023 Nov 28.
6
The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.人工智能解决方案在医疗检查和证书中的准确性和能力:系统评价和荟萃分析。
J Med Internet Res. 2024 Nov 5;26:e56532. doi: 10.2196/56532.
7
The Comparative Performance of Large Language Models on the Hand Surgery Self-Assessment Examination.大型语言模型在手外科自我评估考试中的比较表现
Hand (N Y). 2024 Sep 26:15589447241279460. doi: 10.1177/15589447241279460.
8
Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge.Gemini人工智能与ChatGPT对比:与眼科住院医师一起对医学知识进行的全面考察
Graefes Arch Clin Exp Ophthalmol. 2025 Feb;263(2):527-536. doi: 10.1007/s00417-024-06625-4. Epub 2024 Sep 15.
9
ChatGPT in veterinary medicine: a practical guidance of generative artificial intelligence in clinics, education, and research.ChatGPT在兽医学中的应用:生成式人工智能在临床、教育和研究中的实用指南。
Front Vet Sci. 2024 Jun 7;11:1395934. doi: 10.3389/fvets.2024.1395934. eCollection 2024.
10
ChatGPT: A Conceptual Review of Applications and Utility in the Field of Medicine.ChatGPT:医学领域中的应用及效用的概念性综述。
J Med Syst. 2024 Jun 5;48(1):59. doi: 10.1007/s10916-024-02075-x.