Kong Marianna, Fernandez Alicia, Bains Jaskaran, Milisavljevic Ana, Brooks Katherine C, Shanmugam Akash, Avilez Leslie, Li Junhong, Honcharov Vladyslav, Yang Andersen, Khoong Elaine C
Department of Family and Community Medicine, University of California San Francisco, San Francisco, California, USA.
Division of General Internal Medicine at Zuckerberg San Francisco General Hospital, University of California San Francisco, San Francisco, California, USA.
BMJ Qual Saf. 2025 Jul 9. doi: 10.1136/bmjqs-2024-018384.
Machine translation of patient-specific information could mitigate language barriers if sufficiently accurate and non-harmful and may be particularly useful in healthcare encounters when professional translators are not readily available. We evaluated the translation accuracy and potential for harm of ChatGPT-4 and Google Translate in translating from English to Spanish, Chinese and Russian.
We used ChatGPT-4 and Google Translate to translate 50 sets (316 sentences) of deidentified, patient-specific, clinician free-text emergency department instructions into Spanish, Chinese and Russian. These were then back-translated into English by professional translators and double-coded by physicians for accuracy and potential for clinical harm.
At the sentence level, we found that both tools were ≥90% accurate in translating English to Spanish (accuracy: GPT 97%, Google Translate 96%) and English to Chinese (accuracy: GPT 95%; Google Translate 90%); neither tool performed as well in translating English to Russian (accuracy: GPT 89%; Google Translate 80%). At the instruction set level, 16%, 24% and 56% of Spanish, Chinese and Russian GPT-translated instruction sets contained at least one inaccuracy. For Google Translate, 24%, 56% and 66% of Spanish, Chinese and Russian translations contained at least one inaccuracy. The potential for harm due to inaccurate translations was ≤1% for both tools in all languages at the sentence level and ≤6% at the instruction set level. GPT was significantly more accurate than Google Translate in Chinese and Russian at the sentence level; the potential for harm was similar.
These results support the potential of machine translation tools to mitigate gaps in translation services for low-stakes written communication from English to Spanish, while also strengthening the case for caution and for professional oversight in non-low-risk communication. Further research is needed to evaluate machine translation for other languages and more technical content.
如果患者特定信息的机器翻译足够准确且无害,那么它可以缓解语言障碍,并且在专业翻译人员难以获取的医疗服务场景中可能会特别有用。我们评估了ChatGPT-4和谷歌翻译从英语翻译成西班牙语、中文和俄语的翻译准确性及潜在危害。
我们使用ChatGPT-4和谷歌翻译将50组(316个句子)经过去识别处理的、针对特定患者的、临床医生的急诊部门自由文本指令翻译成西班牙语、中文和俄语。然后由专业翻译人员将这些译文回译成英语,并由医生进行双重编码以评估准确性和临床危害可能性。
在句子层面,我们发现两种工具在将英语翻译成西班牙语(准确率:ChatGPT 97%,谷歌翻译96%)以及英语翻译成中文(准确率:ChatGPT 95%;谷歌翻译90%)时准确率均≥90%;在将英语翻译成俄语时,两种工具的表现都没那么好(准确率:ChatGPT 89%;谷歌翻译80%)。在指令集层面,ChatGPT翻译成西班牙语、中文和俄语的指令集中分别有16%、24%和56%至少包含一处不准确之处。对于谷歌翻译,翻译成西班牙语、中文和俄语的译文中分别有24%、56%和66%至少包含一处不准确之处。在句子层面,两种工具在所有语言中因翻译不准确导致的潜在危害均≤1%,在指令集层面则≤6%。在句子层面,ChatGPT在中文和俄语翻译上比谷歌翻译显著更准确;潜在危害相似。
这些结果支持了机器翻译工具在缓解从英语到西班牙语的低风险书面交流翻译服务差距方面的潜力,同时也强化了在非低风险交流中保持谨慎和进行专业监督的理由。需要进一步研究来评估机器翻译在其他语言和更专业内容方面的表现。