Wada Akihiko, Akashi Toshiaki, Shih George, Hagiwara Akifumi, Nishizawa Mitsuo, Hayakawa Yayoi, Kikuta Junko, Shimoji Keigo, Sano Katsuhiro, Kamagata Koji, Nakanishi Atsushi, Aoki Shigeki
Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan.
Clinical Radiology, Weill Cornell Medical College, New York, NY 10065, USA.
Diagnostics (Basel). 2024 Jul 17;14(14):1541. doi: 10.3390/diagnostics14141541.
Integrating large language models (LLMs) such as GPT-4 Turbo into diagnostic imaging faces a significant challenge, with current misdiagnosis rates ranging from 30-50%. This study evaluates how prompt engineering and confidence thresholds can improve diagnostic accuracy in neuroradiology.
We analyze 751 neuroradiology cases from the American Journal of Neuroradiology using GPT-4 Turbo with customized prompts to improve diagnostic precision.
Initially, GPT-4 Turbo achieved a baseline diagnostic accuracy of 55.1%. By reformatting responses to list five diagnostic candidates and applying a 90% confidence threshold, the highest precision of the diagnosis increased to 72.9%, with the candidate list providing the correct diagnosis at 85.9%, reducing the misdiagnosis rate to 14.1%. However, this threshold reduced the number of cases that responded.
Strategic prompt engineering and high confidence thresholds significantly reduce misdiagnoses and improve the precision of the LLM diagnostic in neuroradiology. More research is needed to optimize these approaches for broader clinical implementation, balancing accuracy and utility.
将诸如GPT-4 Turbo等大语言模型整合到诊断成像中面临重大挑战,目前误诊率在30%至50%之间。本研究评估提示工程和置信阈值如何提高神经放射学的诊断准确性。
我们使用GPT-4 Turbo和定制提示对《美国神经放射学杂志》中的751例神经放射学病例进行分析,以提高诊断精度。
最初,GPT-4 Turbo的基线诊断准确率为55.1%。通过重新格式化回复以列出五个诊断候选结果,并应用90%的置信阈值,诊断的最高精度提高到72.9%,候选列表中正确诊断的比例为85.9%,误诊率降至14.1%。然而,这个阈值减少了有回复的病例数量。
策略性提示工程和高置信阈值可显著减少误诊,并提高神经放射学中基于大语言模型诊断的精度。需要更多研究来优化这些方法,以便在更广泛的临床应用中平衡准确性和实用性。