Abdul Sami Mohammed, Abdul Samad Mohammed, Parekh Keyur, Suthar Pokhraj P
Department of Diagnostic Radiology and Nuclear Medicine, Rush University Medical Center, Chicago, USA.
Department of Osteopathic Medicine, Des Moines University College of Osteopathic Medicine, West Des Moines, USA.
Cureus. 2024 Nov 24;16(11):e74359. doi: 10.7759/cureus.74359. eCollection 2024 Nov.
This study aimed to compare the accuracy of two AI models - OpenAI's GPT-4 Turbo (San Francisco, CA) and Meta's LLaMA 3.1 (Menlo Park, CA) - when answering a standardized set of pediatric radiology questions. The primary objective was to evaluate the overall accuracy of each model, while the secondary objective was to assess their performance within subsections.
A total of 79 text-based pediatric radiology questions were selected out of 302 total questions for this comparison. The questions covered seven subsections, including musculoskeletal, chest, and neuroradiology, among others. Image-based questions were excluded to focus on text interpretation and to minimize the sampling bias within each model. Each model was tested independently on the same question set, and the percent accuracy was calculated for both overall performance as well as individual subsections.
GPT-4 Turbo performed at an overall accuracy of 88.6% (70/79 questions), outperforming LLaMA 3.1's 77.2% (61/79). Within subsections, GPT-4 Turbo had higher accuracy in most areas, except for equal accuracy in the neuroradiology section. The subsections with the greatest accuracy for GPT-4 Turbo, in descending order, were chest and cardiac radiology (100%), musculoskeletal system (93.3%), and genitourinary system (92.9%). LLaMA 3.1's highest performance was 86.7% in the musculoskeletal system, while its lowest was 50.0% in chest radiology.
GPT-4 Turbo consistently outperformed LLaMA 3.1 in answering pediatric radiology questions, both overall and within most subsections. These findings suggest that GPT-4 Turbo may offer more accurate responses in specialized medical education, in contrast to LLaMA 3.1's efficient performance, although future research should further evaluate AI models' performance within other fields.
本研究旨在比较两种人工智能模型——OpenAI的GPT-4 Turbo(加利福尼亚州旧金山)和Meta的LLaMA 3.1(加利福尼亚州门洛帕克)——在回答一套标准化儿科放射学问题时的准确性。主要目标是评估每个模型的总体准确性,次要目标是评估它们在各子部分中的表现。
从总共302个问题中选出79个基于文本的儿科放射学问题用于此次比较。这些问题涵盖七个子部分,包括肌肉骨骼、胸部和神经放射学等。基于图像的问题被排除,以专注于文本解读并尽量减少每个模型内的抽样偏差。每个模型在相同的问题集上独立测试,并计算总体表现以及各个子部分的准确率。
GPT-4 Turbo的总体准确率为88.6%(79个问题中的70个),优于LLaMA 3.1的77.2%(79个问题中的61个)。在各子部分中,GPT-4 Turbo在大多数领域的准确率更高,神经放射学部分准确率相同除外。GPT-4 Turbo准确率最高的子部分按降序排列为胸部和心脏放射学(100%)、肌肉骨骼系统(93.3%)和泌尿生殖系统(92.9%)。LLaMA 3.1在肌肉骨骼系统中的最高表现为86.7%,在胸部放射学中的最低表现为50.0%。
在回答儿科放射学问题方面,GPT-4 Turbo在总体和大多数子部分中均持续优于LLaMA 3.1。这些发现表明,与LLaMA 3.1的高效表现相比,GPT-4 Turbo在专业医学教育中可能提供更准确的回答,尽管未来研究应进一步评估人工智能模型在其他领域的表现。