• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT-4 Turbo与Meta的LLaMA 3.1:基于放射学文本问题回答的相关性分析

ChatGPT-4 Turbo and Meta's LLaMA 3.1: A Relative Analysis of Answering Radiology Text-Based Questions.

作者信息

Abdul Sami Mohammed, Abdul Samad Mohammed, Parekh Keyur, Suthar Pokhraj P

机构信息

Department of Diagnostic Radiology and Nuclear Medicine, Rush University Medical Center, Chicago, USA.

Department of Osteopathic Medicine, Des Moines University College of Osteopathic Medicine, West Des Moines, USA.

出版信息

Cureus. 2024 Nov 24;16(11):e74359. doi: 10.7759/cureus.74359. eCollection 2024 Nov.

DOI:10.7759/cureus.74359
PMID:39720391
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11668536/
Abstract

AIMS AND OBJECTIVES

This study aimed to compare the accuracy of two AI models - OpenAI's GPT-4 Turbo (San Francisco, CA) and Meta's LLaMA 3.1 (Menlo Park, CA) - when answering a standardized set of pediatric radiology questions. The primary objective was to evaluate the overall accuracy of each model, while the secondary objective was to assess their performance within subsections.

METHODS AND MATERIALS

A total of 79 text-based pediatric radiology questions were selected out of 302 total questions for this comparison. The questions covered seven subsections, including musculoskeletal, chest, and neuroradiology, among others. Image-based questions were excluded to focus on text interpretation and to minimize the sampling bias within each model. Each model was tested independently on the same question set, and the percent accuracy was calculated for both overall performance as well as individual subsections.

RESULTS

GPT-4 Turbo performed at an overall accuracy of 88.6% (70/79 questions), outperforming LLaMA 3.1's 77.2% (61/79). Within subsections, GPT-4 Turbo had higher accuracy in most areas, except for equal accuracy in the neuroradiology section. The subsections with the greatest accuracy for GPT-4 Turbo, in descending order, were chest and cardiac radiology (100%), musculoskeletal system (93.3%), and genitourinary system (92.9%). LLaMA 3.1's highest performance was 86.7% in the musculoskeletal system, while its lowest was 50.0% in chest radiology.

CONCLUSION

GPT-4 Turbo consistently outperformed LLaMA 3.1 in answering pediatric radiology questions, both overall and within most subsections. These findings suggest that GPT-4 Turbo may offer more accurate responses in specialized medical education, in contrast to LLaMA 3.1's efficient performance, although future research should further evaluate AI models' performance within other fields.

摘要

目的与目标

本研究旨在比较两种人工智能模型——OpenAI的GPT-4 Turbo(加利福尼亚州旧金山)和Meta的LLaMA 3.1(加利福尼亚州门洛帕克)——在回答一套标准化儿科放射学问题时的准确性。主要目标是评估每个模型的总体准确性,次要目标是评估它们在各子部分中的表现。

方法与材料

从总共302个问题中选出79个基于文本的儿科放射学问题用于此次比较。这些问题涵盖七个子部分,包括肌肉骨骼、胸部和神经放射学等。基于图像的问题被排除,以专注于文本解读并尽量减少每个模型内的抽样偏差。每个模型在相同的问题集上独立测试,并计算总体表现以及各个子部分的准确率。

结果

GPT-4 Turbo的总体准确率为88.6%(79个问题中的70个),优于LLaMA 3.1的77.2%(79个问题中的61个)。在各子部分中,GPT-4 Turbo在大多数领域的准确率更高,神经放射学部分准确率相同除外。GPT-4 Turbo准确率最高的子部分按降序排列为胸部和心脏放射学(100%)、肌肉骨骼系统(93.3%)和泌尿生殖系统(92.9%)。LLaMA 3.1在肌肉骨骼系统中的最高表现为86.7%,在胸部放射学中的最低表现为50.0%。

结论

在回答儿科放射学问题方面,GPT-4 Turbo在总体和大多数子部分中均持续优于LLaMA 3.1。这些发现表明,与LLaMA 3.1的高效表现相比,GPT-4 Turbo在专业医学教育中可能提供更准确的回答,尽管未来研究应进一步评估人工智能模型在其他领域的表现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/21ca/11668536/a4928dcd282c/cureus-0016-00000074359-i04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/21ca/11668536/83f4381491d6/cureus-0016-00000074359-i01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/21ca/11668536/0196048accad/cureus-0016-00000074359-i02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/21ca/11668536/b9f64d4b7e12/cureus-0016-00000074359-i03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/21ca/11668536/a4928dcd282c/cureus-0016-00000074359-i04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/21ca/11668536/83f4381491d6/cureus-0016-00000074359-i01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/21ca/11668536/0196048accad/cureus-0016-00000074359-i02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/21ca/11668536/b9f64d4b7e12/cureus-0016-00000074359-i03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/21ca/11668536/a4928dcd282c/cureus-0016-00000074359-i04.jpg

相似文献

1
ChatGPT-4 Turbo and Meta's LLaMA 3.1: A Relative Analysis of Answering Radiology Text-Based Questions.ChatGPT-4 Turbo与Meta的LLaMA 3.1:基于放射学文本问题回答的相关性分析
Cureus. 2024 Nov 24;16(11):e74359. doi: 10.7759/cureus.74359. eCollection 2024 Nov.
2
Comparative Accuracy of ChatGPT 4.0 and Google Gemini in Answering Pediatric Radiology Text-Based Questions.ChatGPT 4.0与谷歌Gemini在回答基于文本的儿科放射学问题时的比较准确性
Cureus. 2024 Oct 5;16(10):e70897. doi: 10.7759/cureus.70897. eCollection 2024 Oct.
3
Programming Chatbots Using Natural Language: Generating Cervical Spine MRI Impressions.使用自然语言编程聊天机器人:生成颈椎MRI影像报告
Cureus. 2024 Sep 14;16(9):e69410. doi: 10.7759/cureus.69410. eCollection 2024 Sep.
4
Performance of GPT-4 Turbo and GPT-4o in Korean Society of Radiology In-Training Examinations.GPT-4 Turbo和GPT-4o在韩国放射学会住院医师培训考试中的表现。
Korean J Radiol. 2025 Jun;26(6):524-531. doi: 10.3348/kjr.2024.1096. Epub 2025 Apr 17.
5
GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。
Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.
6
Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.在从自由文本报告中提取胸部X光检查结果方面,确保隐私的开放权重大型语言模型与封闭权重的GPT-4o具有竞争力。
Radiology. 2025 Jan;314(1):e240895. doi: 10.1148/radiol.240895.
7
Large language models (LLMs) in radiology exams for medical students: Performance and consequences.面向医学生的放射学考试中的大语言模型:表现与影响。
Rofo. 2024 Nov 4. doi: 10.1055/a-2437-2067.
8
Comparative Analysis of ChatGPT-4o and Gemini Advanced Performance on Diagnostic Radiology In-Training Exams.ChatGPT-4o与Gemini在放射诊断学培训考试中的性能对比分析
Cureus. 2025 Mar 20;17(3):e80874. doi: 10.7759/cureus.80874. eCollection 2025 Mar.
9
The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education.ChatGPT在骨科在职培训考试中的表现:GPT-3.5 turbo和GPT-4模型在骨科教育中的比较研究。
J Orthop. 2023 Nov 23;50:70-75. doi: 10.1016/j.jor.2023.11.056. eCollection 2024 Apr.
10
Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.最新大语言模型在回答牙科多项选择题方面的准确性:一项比较研究。
PLoS One. 2025 Jan 29;20(1):e0317423. doi: 10.1371/journal.pone.0317423. eCollection 2025.

本文引用的文献

1
Comparative Accuracy of ChatGPT 4.0 and Google Gemini in Answering Pediatric Radiology Text-Based Questions.ChatGPT 4.0与谷歌Gemini在回答基于文本的儿科放射学问题时的比较准确性
Cureus. 2024 Oct 5;16(10):e70897. doi: 10.7759/cureus.70897. eCollection 2024 Oct.
2
Comparative Evaluation of AI Models Such as ChatGPT 3.5, ChatGPT 4.0, and Google Gemini in Neuroradiology Diagnostics.ChatGPT 3.5、ChatGPT 4.0和谷歌Gemini等人工智能模型在神经放射学诊断中的比较评估
Cureus. 2024 Aug 25;16(8):e67766. doi: 10.7759/cureus.67766. eCollection 2024 Aug.
3
Comparative Evaluation of LLMs in Clinical Oncology.
临床肿瘤学中大型语言模型的比较评估
NEJM AI. 2024 May;1(5). doi: 10.1056/aioa2300151. Epub 2024 Apr 16.
4
The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.使用大型语言模型(如 ChatGPT、GPT-4 或 Llama)作为临床助手的潜力和陷阱。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1884-1891. doi: 10.1093/jamia/ocae184.
5
Large Language Models as Tools to Generate Radiology Board-Style Multiple-Choice Questions.大语言模型在生成放射科 Board 式多项选择题中的应用。
Acad Radiol. 2024 Sep;31(9):3872-3878. doi: 10.1016/j.acra.2024.06.046. Epub 2024 Jul 15.
6
Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spaCy's rule-based and machine learning-based methods.利用GPT-4在电子健康记录中识别癌症表型:GPT-4、GPT-3.5-turbo、Flan-T5、Llama-3-8B与spaCy基于规则和基于机器学习的方法之间的性能比较。
JAMIA Open. 2024 Jul 3;7(3):ooae060. doi: 10.1093/jamiaopen/ooae060. eCollection 2024 Oct.
7
Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks.系统分析 ChatGPT、Google 搜索和 Llama 2 在临床决策支持任务中的应用。
Nat Commun. 2024 Mar 6;15(1):2050. doi: 10.1038/s41467-024-46411-8.
8
Exploring AI-chatbots' capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases.探索 AI 聊天机器人在眼科手术规划方面的建议能力:ChatGPT 与 Google Gemini 对视网膜脱离病例的分析比较。
Br J Ophthalmol. 2024 Sep 20;108(10):1457-1469. doi: 10.1136/bjo-2023-325143.
9
Artificial Intelligence (AI) in Radiology: A Deep Dive Into ChatGPT 4.0's Accuracy with the American Journal of Neuroradiology's (AJNR) "Case of the Month".放射学中的人工智能(AI):深入探讨ChatGPT 4.0与《美国神经放射学杂志》(AJNR)“月度病例”的准确性。
Cureus. 2023 Aug 23;15(8):e43958. doi: 10.7759/cureus.43958. eCollection 2023 Aug.
10
ChatGPT's Diagnostic Performance from Patient History and Imaging Findings on the Diagnosis Please Quizzes.ChatGPT在诊断问答中基于患者病史和影像检查结果的诊断性能。
Radiology. 2023 Jul;308(1):e231040. doi: 10.1148/radiol.231040.