GPT-4在多种语言中的回答准确性：来自日本专家级诊断放射学考试的见解。

Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan.

作者信息

Harigai Ayaka, Toyama Yoshitaka, Nagano Mitsutoshi, Abe Mirei, Kawabata Masahiro, Li Li, Yamamura Jin, Takase Kei

机构信息

Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, Miyagi, Japan.

Department of Diagnostic Radiology, Tohoku University Graduate School of Medicine, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, Miyagi, Japan.

出版信息

Jpn J Radiol. 2025 Feb;43(2):319-329. doi: 10.1007/s11604-024-01673-6. Epub 2024 Oct 28.

DOI:10.1007/s11604-024-01673-6

PMID:39466356

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11790683/

Abstract

PURPOSE

This study aims to investigate the effects of language selection and translation quality on Generative Pre-trained Transformer-4 (GPT-4)'s response accuracy to expert-level diagnostic radiology questions.

MATERIALS AND METHODS

We analyzed 146 diagnostic radiology questions from the Japan Radiology Board Examination (2020-2022), with consensus answers provided by two board-certified radiologists. The questions, originally in Japanese, were translated into English by GPT-4 and DeepL and into German and Chinese by GPT-4. Responses were generated by GPT-4 five times per question set per language. Response accuracy was compared between languages using one-way ANOVA with Bonferroni correction or the Mann-Whitney U test. Scores on selected English questions translated by a professional service and GPT-4 were also compared. The impact of translation quality on GPT-4's performance was assessed by linear regression analysis.

RESULTS

The median scores (interquartile range) for the 146 questions were 70 (68-72) (Japanese), 89 (84.5-95.5) (GPT-4 English), 64 (55.5-67) (Chinese), and 56 (46.5-67.5) (German). Significant differences were found between Japanese and English (p = 0.002) and between Japanese and German (p = 0.022). The counts of correct responses across five attempts for each question were significantly associated with the quality of translation into English (GPT-4, DeepL) and German (GPT-4). In a subset of 31 questions where English translations yielded fewer correct responses than Japanese originals, professionally translated questions yielded better scores than those translated by GPT-4 (13 versus 8 points, p = 0.0079).

CONCLUSION

GPT-4 exhibits higher accuracy when responding to English-translated questions compared to original Japanese questions, a trend not observed with German or Chinese translations. Accuracy improves with higher-quality English translations, underscoring the importance of high-quality translations in improving GPT-4's response accuracy to diagnostic radiology questions in non-English languages and aiding non-native English speakers in obtaining accurate answers from large language models.

摘要

目的

本研究旨在探讨语言选择和翻译质量对生成式预训练变换器4（GPT-4）回答专家级诊断放射学问题准确性的影响。

材料与方法

我们分析了日本放射学委员会考试（2020 - 2022年）中的146道诊断放射学问题，由两名获得委员会认证的放射科医生提供一致答案。这些原本为日语的问题由GPT-4和DeepL翻译成英语，由GPT-4翻译成德语和中文。针对每个语言的每个问题集，GPT-4生成五次回答。使用带有Bonferroni校正的单向方差分析或Mann-Whitney U检验比较不同语言之间的回答准确性。还比较了由专业服务机构翻译和GPT-4翻译的选定英语问题的得分。通过线性回归分析评估翻译质量对GPT-4性能的影响。

结果

146个问题的中位数得分（四分位间距）分别为70（68 - 72）（日语）、89（84.5 - 95.5）（GPT-4英语）、64（55.5 - 67）（中文）和56（46.5 - 67.5）（德语）。在日语和英语之间（p = 0.002）以及日语和德语之间（p = 0.022）发现了显著差异。每个问题五次尝试中的正确回答次数与翻译成英语（GPT-4、DeepL）和德语（GPT-4）的质量显著相关。在31个问题的子集中，英语翻译产生的正确回答比日语原文少，专业翻译的问题得分高于GPT-4翻译的问题（13分对8分，p = 0.0079）。

结论

与原始日语问题相比，GPT-4在回答英语翻译问题时表现出更高的准确性，德语或中文翻译未观察到这种趋势。高质量的英语翻译可提高准确性，强调了高质量翻译在提高GPT-4对非英语语言诊断放射学问题回答准确性以及帮助非英语母语者从大语言模型中获得准确答案方面的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/107d/11790683/26554fdbb2d4/11604_2024_1673_Fig1_HTML.jpg

相似文献

Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan.GPT-4在多种语言中的回答准确性：来自日本专家级诊断放射学考试的见解。

Jpn J Radiol. 2025 Feb;43(2):319-329. doi: 10.1007/s11604-024-01673-6. Epub 2024 Oct 28.

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。

Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.

Large Language Model Ability to Translate CT and MRI Free-Text Radiology Reports Into Multiple Languages.大型语言模型将CT和MRI自由文本放射学报告翻译成多种语言的能力。

Radiology. 2024 Dec;313(3):e241736. doi: 10.1148/radiol.241736.

Performance of Multimodal Large Language Models in Japanese Diagnostic Radiology Board Examinations (2021-2023).多模态大语言模型在日本诊断放射学委员会考试（2021 - 2023年）中的表现

Acad Radiol. 2025 May;32(5):2394-2401. doi: 10.1016/j.acra.2024.10.035. Epub 2024 Nov 8.

Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations.视觉语言模型在日本放射诊断学、核医学和介入放射学专业委员会考试中的诊断准确性。

Jpn J Radiol. 2024 Dec;42(12):1392-1398. doi: 10.1007/s11604-024-01633-0. Epub 2024 Jul 20.

Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors.基于GPT-4的ChatGPT与放射科医生在使用脑肿瘤真实世界放射学报告方面的诊断性能比较分析。

Eur Radiol. 2025 Apr;35(4):1938-1947. doi: 10.1007/s00330-024-11032-8. Epub 2024 Aug 28.

Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study.ChatGPT在韩国麻醉学与疼痛医学住院医师培训考试中的表现：观察性研究

JMIR Med Educ. 2024 Sep 16;10:e56859. doi: 10.2196/56859.

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.ChatGPT、GPT-4 和 Bard 在日本放射学会官方董事会考试中的表现评估。

Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.

Performance of GPT-4 on the American College of Radiology In-training Examination: Evaluating Accuracy, Model Drift, and Fine-tuning.GPT-4 在美国放射学院实习考试中的表现：评估准确性、模型漂移和微调。

Acad Radiol. 2024 Jul;31(7):3046-3054. doi: 10.1016/j.acra.2024.04.006. Epub 2024 Apr 22.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

引用本文的文献

Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用：性能评估研究

JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.

Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis.评估ChatGPT在眼科板型考试问题上的表现：一项荟萃分析。

J Med Syst. 2025 Jul 5;49(1):94. doi: 10.1007/s10916-025-02227-7.

Diagnostic Performance of a Large Language Model for Determining the Cause of Death: A Comparative Analysis of Clinical History, Postmortem Computed Tomography Findings, and Their Integration.用于确定死因的大语言模型的诊断性能：临床病史、尸检计算机断层扫描结果及其整合的比较分析

Cureus. 2025 May 8;17(5):e83721. doi: 10.7759/cureus.83721. eCollection 2025 May.

本文引用的文献

Scaling neural machine translation to 200 languages.将神经机器翻译扩展到 200 种语言。

Nature. 2024 Jun;630(8018):841-846. doi: 10.1038/s41586-024-07335-x. Epub 2024 Jun 5.

Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.

Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany.医学教育中的人工智能：德国ChatGPT、必应与医学生的比较分析

JMIR Med Educ. 2023 Sep 4;9:e46482. doi: 10.2196/46482.

Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.ChatGPT 在中文体检、病历和教育方面的表现和探索：为医疗 AI 铺平道路。

Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.

Success of ChatGPT, an AI language model, in taking the French language version of the European Board of Ophthalmology examination: A novel approach to medical knowledge assessment.ChatGPT 人工智能语言模型成功通过欧洲眼科委员会法语考试：医学知识评估的新方法。

J Fr Ophtalmol. 2023 Sep;46(7):706-711. doi: 10.1016/j.jfo.2023.05.006. Epub 2023 Aug 1.

Large language models in medicine.医学中的大型语言模型。

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.GPT-3.5和GPT-4在日本医师执照考试中的表现：比较研究。

JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.

Performance of ChatGPT on the pharmacist licensing examination in Taiwan.ChatGPT 在台湾药剂师执照考试中的表现。

J Chin Med Assoc. 2023 Jul 1;86(7):653-658. doi: 10.1097/JCMA.0000000000000942. Epub 2023 Jul 5.

Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations.ChatGPT 在放射科 Board 考试中的表现：当前优势和局限性的深入了解。

Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.GPT-4作为医学人工智能聊天机器人的益处、局限性和风险

N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

GPT-4在多种语言中的回答准确性：来自日本专家级诊断放射学考试的见解。

Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan.

作者信息

机构信息

出版信息

PURPOSE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料与方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献