Cesur Turay, Güneş Yasin Celal
Radiology, Ankara Mamak State Hospital, Ankara, TUR.
Radiology, Kırıkkale Yuksek Ihtisas Hospital, Ankara, TUR.
Cureus. 2024 May 9;16(5):e60009. doi: 10.7759/cureus.60009. eCollection 2024 May.
Background Recent studies have highlighted the diagnostic performance of ChatGPT 3.5 and GPT-4 in a text-based format, demonstrating their radiological knowledge across different areas. Our objective is to investigate the impact of prompt engineering on the diagnostic performance of ChatGPT 3.5 and GPT-4 in diagnosing thoracic radiology cases, highlighting how the complexity of prompts influences model performance. Methodology We conducted a retrospective cross-sectional study using 124 publicly available examples from the website. We initially input the cases into the ChatGPT versions without prompting. Then, we employed five different prompts, ranging from basic task-oriented to complex role-specific formulations to measure the diagnostic accuracy of ChatGPT versions. The differential diagnosis lists generated by the models were compared against the radiological diagnoses listed on the Thoracic Society of Radiology website, with a scoring system in place to comprehensively assess the accuracy. Diagnostic accuracy and differential diagnosis scores were analyzed using the McNemar, Chi-square, Kruskal-Wallis, and Mann-Whitney U tests. Results Without any prompts, ChatGPT 3.5's accuracy was 25% (31/124), which increased to 56.5% (70/124) with the most complex prompt (< 0.001). GPT-4 showed a high baseline accuracy at 53.2% (66/124) without prompting. This accuracy increased to 59.7% (74/124) with complex prompts (= 0.09). Notably, there was no statistical difference in peak performance between ChatGPT 3.5 (70/124) and GPT-4 (74/124) (= 0.55). Conclusions This study emphasizes the critical influence of prompt engineering on enhancing the diagnostic performance of ChatGPT versions, especially ChatGPT 3.5.
背景 最近的研究突出了ChatGPT 3.5和GPT-4在基于文本格式方面的诊断性能,展示了它们在不同领域的放射学知识。我们的目标是研究提示工程对ChatGPT 3.5和GPT-4诊断胸部放射学病例的诊断性能的影响,突出提示的复杂性如何影响模型性能。方法 我们使用来自该网站的124个公开可用示例进行了一项回顾性横断面研究。我们最初在没有提示的情况下将病例输入到ChatGPT版本中。然后,我们采用了五种不同的提示,从基本的任务导向型到复杂的特定角色表述,以测量ChatGPT版本的诊断准确性。将模型生成的鉴别诊断列表与放射学会网站上列出的放射学诊断进行比较,并设有评分系统以全面评估准确性。使用McNemar检验、卡方检验、Kruskal-Wallis检验和Mann-Whitney U检验分析诊断准确性和鉴别诊断分数。结果 在没有任何提示的情况下,ChatGPT 3.5的准确率为25%(31/124),在使用最复杂提示时提高到56.5%(70/124)(<0.001)。GPT-4在没有提示时显示出较高的基线准确率,为53.2%(66/124)。使用复杂提示时,这一准确率提高到59.7%(74/124)(=0.09)。值得注意的是,ChatGPT 3.5(70/124)和GPT-4(74/124)的最佳性能之间没有统计学差异(=0.55)。结论 本研究强调了提示工程对提高ChatGPT版本,尤其是ChatGPT 3.5的诊断性能的关键影响。