Suppr超能文献

GPT-4在多种语言中的回答准确性:来自日本专家级诊断放射学考试的见解。

Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan.

作者信息

Harigai Ayaka, Toyama Yoshitaka, Nagano Mitsutoshi, Abe Mirei, Kawabata Masahiro, Li Li, Yamamura Jin, Takase Kei

机构信息

Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, Miyagi, Japan.

Department of Diagnostic Radiology, Tohoku University Graduate School of Medicine, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, Miyagi, Japan.

出版信息

Jpn J Radiol. 2025 Feb;43(2):319-329. doi: 10.1007/s11604-024-01673-6. Epub 2024 Oct 28.

Abstract

PURPOSE

This study aims to investigate the effects of language selection and translation quality on Generative Pre-trained Transformer-4 (GPT-4)'s response accuracy to expert-level diagnostic radiology questions.

MATERIALS AND METHODS

We analyzed 146 diagnostic radiology questions from the Japan Radiology Board Examination (2020-2022), with consensus answers provided by two board-certified radiologists. The questions, originally in Japanese, were translated into English by GPT-4 and DeepL and into German and Chinese by GPT-4. Responses were generated by GPT-4 five times per question set per language. Response accuracy was compared between languages using one-way ANOVA with Bonferroni correction or the Mann-Whitney U test. Scores on selected English questions translated by a professional service and GPT-4 were also compared. The impact of translation quality on GPT-4's performance was assessed by linear regression analysis.

RESULTS

The median scores (interquartile range) for the 146 questions were 70 (68-72) (Japanese), 89 (84.5-95.5) (GPT-4 English), 64 (55.5-67) (Chinese), and 56 (46.5-67.5) (German). Significant differences were found between Japanese and English (p = 0.002) and between Japanese and German (p = 0.022). The counts of correct responses across five attempts for each question were significantly associated with the quality of translation into English (GPT-4, DeepL) and German (GPT-4). In a subset of 31 questions where English translations yielded fewer correct responses than Japanese originals, professionally translated questions yielded better scores than those translated by GPT-4 (13 versus 8 points, p = 0.0079).

CONCLUSION

GPT-4 exhibits higher accuracy when responding to English-translated questions compared to original Japanese questions, a trend not observed with German or Chinese translations. Accuracy improves with higher-quality English translations, underscoring the importance of high-quality translations in improving GPT-4's response accuracy to diagnostic radiology questions in non-English languages and aiding non-native English speakers in obtaining accurate answers from large language models.

摘要

目的

本研究旨在探讨语言选择和翻译质量对生成式预训练变换器4(GPT-4)回答专家级诊断放射学问题准确性的影响。

材料与方法

我们分析了日本放射学委员会考试(2020 - 2022年)中的146道诊断放射学问题,由两名获得委员会认证的放射科医生提供一致答案。这些原本为日语的问题由GPT-4和DeepL翻译成英语,由GPT-4翻译成德语和中文。针对每个语言的每个问题集,GPT-4生成五次回答。使用带有Bonferroni校正的单向方差分析或Mann-Whitney U检验比较不同语言之间的回答准确性。还比较了由专业服务机构翻译和GPT-4翻译的选定英语问题的得分。通过线性回归分析评估翻译质量对GPT-4性能的影响。

结果

146个问题的中位数得分(四分位间距)分别为70(68 - 72)(日语)、89(84.5 - 95.5)(GPT-4英语)、64(55.5 - 67)(中文)和56(46.5 - 67.5)(德语)。在日语和英语之间(p = 0.002)以及日语和德语之间(p = 0.022)发现了显著差异。每个问题五次尝试中的正确回答次数与翻译成英语(GPT-4、DeepL)和德语(GPT-4)的质量显著相关。在31个问题的子集中,英语翻译产生的正确回答比日语原文少,专业翻译的问题得分高于GPT-4翻译的问题(13分对8分,p = 0.0079)。

结论

与原始日语问题相比,GPT-4在回答英语翻译问题时表现出更高的准确性,德语或中文翻译未观察到这种趋势。高质量的英语翻译可提高准确性,强调了高质量翻译在提高GPT-4对非英语语言诊断放射学问题回答准确性以及帮助非英语母语者从大语言模型中获得准确答案方面的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/107d/11790683/26554fdbb2d4/11604_2024_1673_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验