Harigai Ayaka, Toyama Yoshitaka, Nagano Mitsutoshi, Abe Mirei, Kawabata Masahiro, Li Li, Yamamura Jin, Takase Kei
Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, Miyagi, Japan.
Department of Diagnostic Radiology, Tohoku University Graduate School of Medicine, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, Miyagi, Japan.
Jpn J Radiol. 2025 Feb;43(2):319-329. doi: 10.1007/s11604-024-01673-6. Epub 2024 Oct 28.
This study aims to investigate the effects of language selection and translation quality on Generative Pre-trained Transformer-4 (GPT-4)'s response accuracy to expert-level diagnostic radiology questions.
We analyzed 146 diagnostic radiology questions from the Japan Radiology Board Examination (2020-2022), with consensus answers provided by two board-certified radiologists. The questions, originally in Japanese, were translated into English by GPT-4 and DeepL and into German and Chinese by GPT-4. Responses were generated by GPT-4 five times per question set per language. Response accuracy was compared between languages using one-way ANOVA with Bonferroni correction or the Mann-Whitney U test. Scores on selected English questions translated by a professional service and GPT-4 were also compared. The impact of translation quality on GPT-4's performance was assessed by linear regression analysis.
The median scores (interquartile range) for the 146 questions were 70 (68-72) (Japanese), 89 (84.5-95.5) (GPT-4 English), 64 (55.5-67) (Chinese), and 56 (46.5-67.5) (German). Significant differences were found between Japanese and English (p = 0.002) and between Japanese and German (p = 0.022). The counts of correct responses across five attempts for each question were significantly associated with the quality of translation into English (GPT-4, DeepL) and German (GPT-4). In a subset of 31 questions where English translations yielded fewer correct responses than Japanese originals, professionally translated questions yielded better scores than those translated by GPT-4 (13 versus 8 points, p = 0.0079).
GPT-4 exhibits higher accuracy when responding to English-translated questions compared to original Japanese questions, a trend not observed with German or Chinese translations. Accuracy improves with higher-quality English translations, underscoring the importance of high-quality translations in improving GPT-4's response accuracy to diagnostic radiology questions in non-English languages and aiding non-native English speakers in obtaining accurate answers from large language models.
本研究旨在探讨语言选择和翻译质量对生成式预训练变换器4(GPT-4)回答专家级诊断放射学问题准确性的影响。
我们分析了日本放射学委员会考试(2020 - 2022年)中的146道诊断放射学问题,由两名获得委员会认证的放射科医生提供一致答案。这些原本为日语的问题由GPT-4和DeepL翻译成英语,由GPT-4翻译成德语和中文。针对每个语言的每个问题集,GPT-4生成五次回答。使用带有Bonferroni校正的单向方差分析或Mann-Whitney U检验比较不同语言之间的回答准确性。还比较了由专业服务机构翻译和GPT-4翻译的选定英语问题的得分。通过线性回归分析评估翻译质量对GPT-4性能的影响。
146个问题的中位数得分(四分位间距)分别为70(68 - 72)(日语)、89(84.5 - 95.5)(GPT-4英语)、64(55.5 - 67)(中文)和56(46.5 - 67.5)(德语)。在日语和英语之间(p = 0.002)以及日语和德语之间(p = 0.022)发现了显著差异。每个问题五次尝试中的正确回答次数与翻译成英语(GPT-4、DeepL)和德语(GPT-4)的质量显著相关。在31个问题的子集中,英语翻译产生的正确回答比日语原文少,专业翻译的问题得分高于GPT-4翻译的问题(13分对8分,p = 0.0079)。
与原始日语问题相比,GPT-4在回答英语翻译问题时表现出更高的准确性,德语或中文翻译未观察到这种趋势。高质量的英语翻译可提高准确性,强调了高质量翻译在提高GPT-4对非英语语言诊断放射学问题回答准确性以及帮助非英语母语者从大语言模型中获得准确答案方面的重要性。