Chen Christine L, Dong Yue, Castillo-Zambrano Claudia, Bencheqroun Hassan, Barwise Amelia, Hoffman Adria, Nalaie Keivan, Qiu Yishu, Boulekbache Oualid, Niven Alexander S
Division of Internal Medicine, Mayo Clinic, 200 First St. SW, Rochester, MN, 55905, USA.
Division of Pulmonary and Critical Care Medicine, Mayo Clinic, 200 First St. SW, Rochester, MN, 55905, USA.
BMC Med Educ. 2025 Jul 8;25(1):1022. doi: 10.1186/s12909-025-07452-9.
Language barriers pose a significant barrier to expanding access to critical care education worldwide. Machine translation (MT) offers significant promise to increase accessibility to critical care content, and has rapidly evolved using newer artificial intelligence frameworks and large language models. The best approach to systematically apply and evaluate these tools, however, remains unclear.
We developed a multimodal method to evaluate translations of critical care content used as part of an established international critical care education program. Four freely-available MT tools were selected (DeepL™, Google Gemini™, Google Translate™, Microsoft CoPilot™) and used to translate selected phrases and paragraphs into Chinese (Mandarin), Spanish, and Ukrainian. A human translation performed by a professional medical translator was used for comparison. These translations were compared using 1) blinded bilingual clinician evaluations using anchored Likert domains of fluency, adequacy, and meaning; 2) automated BiLingual Evaluation Understudy (BLEU) scores; and 3) validated system usability scale to assess the ease of use of MT tools. Blinded bilingual clinician evaluations were calculated as individual domains and averaged composite scores.
Blinded clinician composite scores were highest for human translation (Chinese), Google Gemini (Spanish), and Microsoft CoPilot (Ukrainian). Microsoft CoPilot (Chinese) and Google Translate (Spanish and Ukrainian) earned the lowest scores. All Chinese and Spanish versions received "understandable to good" or "high quality" BLEU scores, while Ukrainian overall scored "hard to get the gist" except using Microsoft CoPilot. Usability scores were highest with DeepL (Chinese), Google Gemini (Spanish), and Google Translate (Ukrainian), and lower with Microsoft CoPilot (Chinese and Ukrainian) and Google Translate (Spanish).
No single MT tool performed best across all metrics and languages, highlighting the importance of routine assessment of these tools during educational activities given their rapid ongoing evolution. We offer a multimodal evaluation methodology to aid this assessment as medical educators expand their use of MT in international educational programs.
语言障碍对在全球范围内扩大重症监护教育的可及性构成了重大障碍。机器翻译(MT)为增加重症监护内容的可及性带来了巨大希望,并且已使用更新的人工智能框架和大语言模型迅速发展。然而,系统应用和评估这些工具的最佳方法仍不明确。
我们开发了一种多模态方法,以评估作为既定国际重症监护教育计划一部分使用的重症监护内容的翻译。选择了四个免费的机器翻译工具(DeepL™、谷歌Gemini™、谷歌翻译™、微软Copilot™),并用于将选定的短语和段落翻译成中文(普通话)、西班牙语和乌克兰语。由专业医学翻译人员进行的人工翻译用于比较。使用以下方法对这些翻译进行比较:1)采用流畅性、充分性和意义的锚定李克特量表进行盲法双语临床医生评估;2)自动双语评估替代指标(BLEU)得分;3)经过验证的系统可用性量表,以评估机器翻译工具的易用性。盲法双语临床医生评估按各个领域计算,并得出平均综合得分。
人工翻译(中文)、谷歌Gemini(西班牙语)和微软Copilot(乌克兰语)的盲法临床医生综合得分最高。微软Copilot(中文)以及谷歌翻译(西班牙语和乌克兰语)得分最低。所有中文和西班牙语版本的BLEU得分均为“可理解到良好”或“高质量”,而乌克兰语版本总体得分“难以理解主旨”,不过使用微软Copilot时除外。DeepL(中文)、谷歌Gemini(西班牙语)和谷歌翻译(乌克兰语)的可用性得分最高,微软Copilot(中文和乌克兰语)以及谷歌翻译(西班牙语)的可用性得分较低。
没有单一的机器翻译工具在所有指标和语言上都表现最佳,鉴于这些工具正在迅速发展,这凸显了在教育活动中对其进行常规评估的重要性。随着医学教育工作者在国际教育计划中扩大对机器翻译的使用,我们提供了一种多模态评估方法来辅助这一评估。