Ray Mondira, Kats Daniel J, Moorkens Joss, Rai Dinesh, Shaar Nate, Quinones Diane, Vermeulen Alejandro, Mateo Camila M, Brewster Ryan C L, Khan Alisa, Rader Benjamin, Brownstein John S, Hron Jonathan D
Division of General Pediatrics, Boston Children's Hospital, Boston, Massachusetts.
Department of Pediatrics, Harvard Medical School, Boston, Massachusetts.
JAMA Pediatr. 2025 Jul 7. doi: 10.1001/jamapediatrics.2025.1729.
Patients and caregivers who use languages other than English in the US encounter barriers to accessing language-concordant written instructions after clinical visits. Large language models (LLMs), such as OpenAI's GPT-4o, may improve access to translated patient materials; however, rigorous evaluation is needed to ensure clinical standards are met.
To determine whether GPT-4o can generate high-quality Spanish translations of personalized patient instructions comparable to those performed by professional human translators.
DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study compared LLM translations to professional human translations using equivalence testing. The personalized pediatric instructions used were derived from real clinical encounters at a large US academic medical center and translated between January 2023 and December 2023. Patient instructions in English were translated into Spanish by GPT-4o and professional human translators. The source English texts were translated using GPT-4o on August 2, 2024. Both sets of translations were evaluated by 3 independent professional medical translators.
Patient instructions were translated using GPT-4o with an engineered prompt, and these translations were compared with those produced by professional human translators.
The primary outcome was translation quality, assessed using the Multidimensional Quality Metrics (MQM) framework to generate an overall MQM score (rated on a 0-100 scale). Secondary outcomes included a general preference rating and error rates for types of translation errors.
This study included 20 source files of pediatric patient instructions. Equivalence testing showed no significant difference in translation quality between GPT-4o and human translations, with a mean difference of 1.6 points (90% CI, 0.7-2.5), falling within a predefined equivalence margin of plus or minus 5 MQM points. The LLM yielded fewer mistranslation errors, and a mean (SE) of 52% (6%) of professional translator ratings preferred the LLM translations.
In this cross-sectional study, GPT-4o generated Spanish translations of pediatric patient instructions that were comparable in quality to those by professional human translators as evaluated using a standardized framework. While human review of LLM translation remains essential in health care, these findings suggest that GPT-4o could reduce the translation workload for Spanish, potentially freeing resources to support languages of lesser diffusion.
在美国,使用英语以外语言的患者和护理人员在临床就诊后获取语言匹配的书面说明时会遇到障碍。大型语言模型(LLMs),如OpenAI的GPT - 4o,可能会改善获取翻译后的患者资料的情况;然而,需要进行严格评估以确保符合临床标准。
确定GPT - 4o能否生成与专业人工翻译质量相当的个性化患者说明的高质量西班牙语译文。
设计、背景和参与者:这项横断面研究使用等效性测试将大型语言模型的翻译与专业人工翻译进行比较。所使用的个性化儿科说明源自美国一家大型学术医疗中心的真实临床病例,并于2023年1月至2023年12月期间进行翻译。英语患者说明由GPT - 4o和专业人工翻译人员翻译成西班牙语。源英语文本于2024年8月2日使用GPT - 4o进行翻译。两组译文均由3名独立的专业医学翻译人员进行评估。
使用带有设计好提示的GPT - 4o翻译患者说明,并将这些译文与专业人工翻译人员的译文进行比较。
主要结局是翻译质量,使用多维质量指标(MQM)框架进行评估以生成总体MQM分数(评分范围为0 - 100分)。次要结局包括总体偏好评分和各类翻译错误的错误率。
本研究纳入了20份儿科患者说明的源文件。等效性测试显示GPT - 4o和人工翻译的翻译质量无显著差异,平均差异为1.6分(90%CI,0.7 - 2.5),在预先定义的正负5个MQM分数的等效范围内。大型语言模型产生的误译错误较少,专业翻译人员评分中平均(SE)有52%(6%)更喜欢大型语言模型的译文。
在这项横断面研究中,使用标准化框架评估时,GPT - 4o生成的儿科患者说明的西班牙语译文质量与专业人工翻译相当。虽然在医疗保健中对大型语言模型翻译进行人工审核仍然至关重要,但这些发现表明GPT - 4o可以减少西班牙语的翻译工作量,有可能释放资源以支持使用较少的语言。