• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于从兽医电子健康记录中提取信息的GPT-4全知模型的分类性能和可重复性

Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records.

作者信息

Wulcan Judit M, Jacques Kevin L, Lee Mary Ann, Kovacs Samantha L, Dausend Nicole, Prince Lauren E, Wulcan Jonatan, Marsilio Sina, Keller Stefan M

机构信息

Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United States.

College of Veterinary Medicine and Biomedical Sciences, James L. Voss Veterinary Teaching Hospital, Colorado State University, Fort Collins, CO, United States.

出版信息

Front Vet Sci. 2025 Jan 16;11:1490030. doi: 10.3389/fvets.2024.1490030. eCollection 2024.

DOI:10.3389/fvets.2024.1490030
PMID:39885843
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11780673/
Abstract

Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of hyperparameter settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and by investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. When compared to the majority opinion of human respondents, GPT-4o demonstrated 96.9% sensitivity [interquartile range (IQR) 92.9-99.3%], 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value (IQR 70.8-84.6%), 99.5% negative predictive value (IQR 99.0-99.9%), 84.4% F1 score (IQR 77.3-90.4%), and 96.3% balanced accuracy (IQR 95.0-97.9%). The performance of GPT-4o was significantly better than that of its predecessor, GPT-3.5 Turbo, particularly with respect to sensitivity where GPT-3.5 Turbo only achieved 81.7% (IQR 78.9-84.8%). GPT-4o demonstrated greater reproducibility than human pairs, with an average Cohen's kappa of 0.98 (IQR 0.98-0.99) compared to 0.80 (IQR 0.78-0.81) with humans. Most GPT-4o errors occurred in instances where humans disagreed [35/43 errors (81.4%)], suggesting that these errors were more likely caused by ambiguity of the EHR than explicit model faults. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction, but requires validation for the intended setting to ensure accuracy and reliability.

摘要

大语言模型(LLMs)可以从兽医电子健康记录(EHRs)中提取信息,但此前尚未评估不同模型之间的性能差异、超参数设置的影响以及文本模糊性的影响。本研究通过比较GPT-4 omni(GPT-4o)和GPT-3.5 Turbo在不同条件下的性能,并研究人类观察者间一致性与大语言模型错误之间的关系,填补了这些空白。大语言模型和五名人类被要求从一家兽医转诊医院的250份电子健康记录中识别与猫慢性肠病相关的六种临床症状。与人类受访者的多数意见相比,GPT-4o的灵敏度为96.9%[四分位间距(IQR)92.9 - 99.3%],特异度为97.6%(IQR 96.5 - 98.5%),阳性预测值为80.7%(IQR 70.8 - 84.6%),阴性预测值为99.5%(IQR 99.0 - 99.9%),F1分数为84.4%(IQR 77.3 - 90.4%),平衡准确度为96.3%(IQR 95.0 - 97.9%)。GPT-4o的性能明显优于其前身GPT-3.5 Turbo,特别是在灵敏度方面,GPT-3.5 Turbo仅达到81.7%(IQR 78.9 - 84.8%)。GPT-4o表现出比人类配对更高的可重复性,平均科恩kappa系数为0.98(IQR 0.98 - 0.99),而人类为0.80(IQR 0.78 - 0.81)。大多数GPT-4o错误发生在人类意见不一致的情况下[43个错误中有35个(81.4%)],这表明这些错误更可能是由电子健康记录的模糊性而非模型的明确故障引起的。使用GPT-4o自动从兽医电子健康记录中提取信息是手动提取的可行替代方案,但需要针对预期设置进行验证,以确保准确性和可靠性。

相似文献

1
Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records.用于从兽医电子健康记录中提取信息的GPT-4全知模型的分类性能和可重复性
Front Vet Sci. 2025 Jan 16;11:1490030. doi: 10.3389/fvets.2024.1490030. eCollection 2024.
2
Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.在从自由文本报告中提取胸部X光检查结果方面,确保隐私的开放权重大型语言模型与封闭权重的GPT-4o具有竞争力。
Radiology. 2025 Jan;314(1):e240895. doi: 10.1148/radiol.240895.
3
Extracting Pulmonary Embolism Diagnoses From Radiology Impressions Using GPT-4o: Large Language Model Evaluation Study.使用GPT-4o从放射学诊断印象中提取肺栓塞诊断:大语言模型评估研究
JMIR Med Inform. 2025 Apr 9;13:e67706. doi: 10.2196/67706.
4
AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.家庭护理中的人工智能——对用于未来非正式护理人员培训的大语言模型的评估:观察性比较案例研究
J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.
5
The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis.大语言模型在挖掘电子健康记录数据中的变革潜力:内容分析
JMIR Med Inform. 2025 Jan 2;13:e58457. doi: 10.2196/58457.
6
An Evaluation of the Performance of OpenAI-o1 and GPT-4o in the Japanese National Examination for Physical Therapists.OpenAI-o1和GPT-4o在日本物理治疗师国家考试中的表现评估
Cureus. 2025 Jan 6;17(1):e76989. doi: 10.7759/cureus.76989. eCollection 2025 Jan.
7
irAE-GPT: Leveraging large language models to identify immune-related adverse events in electronic health records and clinical trial datasets.免疫相关不良事件生成式预训练变换器(irAE-GPT):利用大语言模型在电子健康记录和临床试验数据集中识别免疫相关不良事件。
medRxiv. 2025 Mar 6:2025.03.05.25323445. doi: 10.1101/2025.03.05.25323445.
8
Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study.大型语言模型在专家级重症监护问题上的比较评估与性能:一项基准研究。
Crit Care. 2025 Feb 10;29(1):72. doi: 10.1186/s13054-025-05302-0.
9
Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam.评估人工智能在核心脏病学方面的熟练程度:大型语言模型参加资格考试。
J Nucl Cardiol. 2025 Mar;45:102089. doi: 10.1016/j.nuclcard.2024.102089. Epub 2024 Nov 29.
10
Performance of GPT-4 Turbo and GPT-4o in Korean Society of Radiology In-Training Examinations.GPT-4 Turbo和GPT-4o在韩国放射学会住院医师培训考试中的表现。
Korean J Radiol. 2025 Jun;26(6):524-531. doi: 10.3348/kjr.2024.1096. Epub 2025 Apr 17.

引用本文的文献

1
AI-Powered Drug Classification and Indication Mapping for Pharmacoepidemiologic Studies: Prompt Development and Validation.用于药物流行病学研究的人工智能驱动的药物分类和适应症映射:提示开发与验证
JMIR AI. 2025 Jun 12;4:e65481. doi: 10.2196/65481.
2
Fine-tuning LLM hyperparameters to align semantic and physiological contexts of aging-related pathways.微调大语言模型超参数以匹配衰老相关通路的语义和生理背景。
Mol Divers. 2025 Jun 6. doi: 10.1007/s11030-025-11226-2.

本文引用的文献

1
The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis.大语言模型在挖掘电子健康记录数据中的变革潜力:内容分析
JMIR Med Inform. 2025 Jan 2;13:e58457. doi: 10.2196/58457.
2
ChatGPT in veterinary medicine: a practical guidance of generative artificial intelligence in clinics, education, and research.ChatGPT在兽医学中的应用:生成式人工智能在临床、教育和研究中的实用指南。
Front Vet Sci. 2024 Jun 7;11:1395934. doi: 10.3389/fvets.2024.1395934. eCollection 2024.
3
Detecting hallucinations in large language models using semantic entropy.
使用语义熵检测大型语言模型中的幻觉。
Nature. 2024 Jun;630(8017):625-630. doi: 10.1038/s41586-024-07421-0. Epub 2024 Jun 19.
4
Large language models to identify social determinants of health in electronic health records.利用大语言模型识别电子健康记录中的健康社会决定因素。
NPJ Digit Med. 2024 Jan 11;7(1):6. doi: 10.1038/s41746-023-00970-0.
5
Computerized cognitive training for memory functions in mild cognitive impairment or dementia: a systematic review and meta-analysis.针对轻度认知障碍或痴呆患者记忆功能的计算机化认知训练:一项系统评价与荟萃分析。
NPJ Digit Med. 2024 Jan 3;7(1):1. doi: 10.1038/s41746-023-00987-5.
6
Evaluating ChatGPT text mining of clinical records for companion animal obesity monitoring.评估 ChatGPT 对临床记录进行挖掘以监测伴侣动物肥胖症。
Vet Rec. 2024 Feb 3;194(3):e3669. doi: 10.1002/vetr.3669. Epub 2023 Dec 6.
7
ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health.ChatGPT 和大型语言模型的兴起:公共卫生领域新的 AI 驱动的信息疫情威胁。
Front Public Health. 2023 Apr 25;11:1166120. doi: 10.3389/fpubh.2023.1166120. eCollection 2023.
8
ACVIM consensus statement guidelines on diagnosing and distinguishing low-grade neoplastic from inflammatory lymphocytic chronic enteropathies in cats.ACVIM 共识声明指南:诊断和鉴别猫低级别肿瘤性与炎症性淋巴细胞慢性肠病。
J Vet Intern Med. 2023 May-Jun;37(3):794-816. doi: 10.1111/jvim.16690. Epub 2023 May 2.
9
ggalluvial: Layered Grammar for Alluvial Plots.ggalluvial:用于冲积图的分层语法。
J Open Source Softw. 2020;5(49). doi: 10.21105/joss.02017. Epub 2020 May 21.
10
Extracting information from the text of electronic medical records to improve case detection: a systematic review.从电子病历文本中提取信息以改善病例检测:一项系统综述
J Am Med Inform Assoc. 2016 Sep;23(5):1007-15. doi: 10.1093/jamia/ocv180. Epub 2016 Feb 5.