Xiong Yu-Tao, Lian Wen-Jun, Sun Ya-Nan, Liu Wei, Guo Ji-Xiang, Tang Wei, Liu Chang
State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, 610041, China.
College of Computer Science, Sichuan University, Chengdu, 610065, China.
Clin Oral Investig. 2025 Aug 12;29(9):405. doi: 10.1007/s00784-025-06498-9.
The aim of this study was to evaluate GPT-4o's multimodal reasoning ability to review panoramic radiograph (PR) and verify its radiologic findings, while exploring the role of prompt engineering in enhancing its performance.
The study included 230 PRs from West China Hospital of Stomatology in 2024, which were interpreted to generate the PR findings. A total of 300 instances of interpretation errors, were manually inserted into the PR findings. The ablation study was conducted to assess whether GPT-4o can perform reasoning on PR under a zero-shot prompt. Prompt engineering was employed to enhance the reasoning capabilities of GPT-4o in identifying interpretation errors with PRs. The prompt strategies included chain-of-thought, self-consistency, in-context learning, multimodal in-context learning, and their systematic integration into a meta-prompt. Recall, accuracy, and F1 score were employed to evaluate the outputs. Subsequently, the localization capability of GPT-4o and its influence on reasoning capability were evaluated.
In the ablation study, GPT-4o's recall increased significantly from 2.67 to 43.33% upon acquiring PRs (P < 0.001). GPT-4o with the meta prompt demonstrated improvements in recall (43.33% vs. 52.67%, P = 0.022), accuracy (39.95% vs. 68.75%, P < 0.001), and F1 score (0.42 vs. 0.60, P < 0.001) compared to the zero-shot prompt and other prompt strategies. The localization accuracy of GPT-4o was 45.67% (137 out of 300, 95% CI: 40.00 to 51.34). A significant correlation was observed between its localization accuracy and reasoning capability under the meta prompt (φ coefficient = 0.33, p < 0.001). The model's recall increased by 5.49% (P = 0.031) by providing accurate localization cues within the meta prompt.
GPT-4o demonstrated a certain degree of multimodal capability for PR, with performance enhancement through prompt engineering. Nevertheless, its performance remains inadequate for clinical requirements. Future efforts will be necessary to identify additional factors influencing the model's reasoning capability or to develop more advanced models.
Evaluating GPT-4o's capability to interpret and reason through PRs and exploring potential methods to enhance its performance before clinical application in assisting radiological assessments.
本研究旨在评估GPT-4o审查全景X线片(PR)并验证其放射学结果的多模态推理能力,同时探索提示工程在提高其性能方面的作用。
本研究纳入了2024年来自四川大学华西口腔医院的230张PR,对其进行解读以生成PR结果。总共300例解读错误被人工插入到PR结果中。进行消融研究以评估GPT-4o在零样本提示下能否对PR进行推理。采用提示工程来提高GPT-4o识别PR解读错误的推理能力。提示策略包括思维链、自一致性、上下文学习、多模态上下文学习,以及将它们系统整合到一个元提示中。召回率、准确率和F1分数用于评估输出结果。随后,评估了GPT-4o的定位能力及其对推理能力的影响。
在消融研究中,GPT-4o在获取PR后召回率从2.67%显著提高到43.33%(P < 0.001)。与零样本提示和其他提示策略相比,采用元提示的GPT-4o在召回率(43.33%对52.67%,P = 0.022)、准确率(39.95%对68.75%,P < 0.001)和F1分数(0.42对0.60,P < 0.001)方面均有提高。GPT-4o的定位准确率为45.67%(300个中的137个,95%置信区间:40.00至51.34)。在元提示下,观察到其定位准确率与推理能力之间存在显著相关性(φ系数 = 0.33,p < 0.001)。通过在元提示中提供准确的定位线索,模型的召回率提高了5.49%(P = 0.031)。
GPT-4o在PR方面表现出一定程度的多模态能力,通过提示工程可提高性能。然而,其性能仍不符合临床要求。未来需要进一步努力确定影响模型推理能力的其他因素或开发更先进的模型。
在临床应用辅助放射学评估之前,评估GPT-4o通过PR进行解读和推理的能力,并探索提高其性能的潜在方法。