Faculty of Informatics, Computer Languages and Systems, Ixa Research Group, University of the Basque Country (UPV/EHU), Donostia, Spain.
J Am Med Inform Assoc. 2019 Dec 1;26(12):1478-1487. doi: 10.1093/jamia/ocz110.
To analyze techniques for machine translation of electronic health records (EHRs) between long distance languages, using Basque and Spanish as a reference. We studied distinct configurations of neural machine translation systems and used different methods to overcome the lack of a bilingual corpus of clinical texts or health records in Basque and Spanish.
We trained recurrent neural networks on an out-of-domain corpus with different hyperparameter values. Subsequently, we used the optimal configuration to evaluate machine translation of EHR templates between Basque and Spanish, using manual translations of the Basque templates into Spanish as a standard. We successively added to the training corpus clinical resources, including a Spanish-Basque dictionary derived from resources built for the machine translation of the Spanish edition of SNOMED CT into Basque, artificial sentences in Spanish and Basque derived from frequently occurring relationships in SNOMED CT, and Spanish monolingual EHRs. Apart from calculating bilingual evaluation understudy (BLEU) values, we tested the performance in the clinical domain by human evaluation.
We achieved slight improvements from our reference system by tuning some hyperparameters using an out-of-domain bilingual corpus, obtaining 10.67 BLEU points for Basque-to-Spanish clinical domain translation. The inclusion of clinical terminology in Spanish and Basque and the application of the back-translation technique on monolingual EHRs significantly improved the performance, obtaining 21.59 BLEU points. This was confirmed by the human evaluation performed by 2 clinicians, ranking our machine translations close to the human translations.
We showed that, even after optimizing the hyperparameters out-of-domain, the inclusion of available resources from the clinical domain and applied methods were beneficial for the described objective, managing to obtain adequate translations of EHR templates.
We have developed a system which is able to properly translate health record templates from Basque to Spanish without making use of any bilingual corpus of clinical texts or health records.
分析长距离语言间电子病历(EHR)机器翻译技术,以巴斯克语和西班牙语为例。我们研究了不同的神经机器翻译系统配置,并使用不同的方法来克服巴斯克语和西班牙语缺乏临床文本或健康记录双语语料库的问题。
我们在一个不同超参数值的非领域语料库上训练循环神经网络。然后,我们使用最佳配置来评估巴斯克语和西班牙语之间的 EHR 模板机器翻译,将巴斯克语模板的人工翻译作为西班牙语的标准。我们依次向训练语料库中添加临床资源,包括源自为西班牙语版 SNOMED CT 到巴斯克语机器翻译构建的资源的西班牙语-巴斯克词典、源自 SNOMED CT 中常见关系的人工西班牙语和巴斯克语句子,以及西班牙语单语 EHR。除了计算双语评估研究(BLEU)值外,我们还通过人工评估测试了临床领域的性能。
我们通过使用非领域双语语料库调整一些超参数,使参考系统略有改进,为巴斯克语到西班牙语临床领域翻译获得 10.67 BLEU 分。纳入西班牙语和巴斯克语的临床术语以及在单语 EHR 上应用回译技术显著提高了性能,获得 21.59 BLEU 分。这一点得到了 2 名临床医生进行的人工评估的证实,他们将我们的机器翻译与人工翻译的排名接近。
我们表明,即使在优化非领域超参数后,纳入临床领域的可用资源和应用的方法对描述的目标仍然有益,能够获得 EHR 模板的适当翻译。
我们开发了一种系统,能够在不使用任何临床文本或健康记录双语语料库的情况下,正确地将健康记录模板从巴斯克语翻译成西班牙语。