Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands.
J Am Med Inform Assoc. 2024 Aug 1;31(8):1725-1734. doi: 10.1093/jamia/ocae159.
To explore the feasibility of validating Dutch concept extraction tools using annotated corpora translated from English, focusing on preserving annotations during translation and addressing the scarcity of non-English annotated clinical corpora.
Three annotated corpora were standardized and translated from English to Dutch using 2 machine translation services, Google Translate and OpenAI GPT-4, with annotations preserved through a proposed method of embedding annotations in the text before translation. The performance of 2 concept extraction tools, MedSpaCy and MedCAT, was assessed across the corpora in both Dutch and English.
The translation process effectively generated Dutch annotated corpora and the concept extraction tools performed similarly in both English and Dutch. Although there were some differences in how annotations were preserved across translations, these did not affect extraction accuracy. Supervised MedCAT models consistently outperformed unsupervised models, whereas MedSpaCy demonstrated high recall but lower precision.
Our validation of Dutch concept extraction tools on corpora translated from English was successful, highlighting the efficacy of our annotation preservation method and the potential for efficiently creating multilingual corpora. Further improvements and comparisons of annotation preservation techniques and strategies for corpus synthesis could lead to more efficient development of multilingual corpora and accurate non-English concept extraction tools.
This study has demonstrated that translated English corpora can be used to validate non-English concept extraction tools. The annotation preservation method used during translation proved effective, and future research can apply this corpus translation method to additional languages and clinical settings.
探索使用已标注的英文翻译语料库验证荷兰概念提取工具的可行性,重点是在翻译过程中保留标注并解决非英文标注临床语料库稀缺的问题。
使用两种机器翻译服务(谷歌翻译和 OpenAI GPT-4)对三个标注语料库进行标准化和英文到荷兰文的翻译,通过在翻译前将标注嵌入文本的建议方法来保留标注。评估了 MedSpaCy 和 MedCAT 两种概念提取工具在荷兰语和英语语料库中的性能。
翻译过程有效地生成了荷兰语标注语料库,并且概念提取工具在英语和荷兰语中表现相似。尽管在翻译过程中保留标注的方式存在一些差异,但这些差异并未影响提取准确性。有监督的 MedCAT 模型始终优于无监督模型,而 MedSpaCy 则表现出较高的召回率但较低的精度。
我们对从英语翻译而来的语料库中的荷兰语概念提取工具进行了验证,这表明我们的标注保留方法有效,并且有可能高效地创建多语言语料库。进一步改进和比较标注保留技术以及语料库合成策略,可以促进多语言语料库的高效开发和非英语概念提取工具的准确性。
本研究表明,可以使用翻译后的英语语料库来验证非英语概念提取工具。在翻译过程中使用的标注保留方法效果良好,未来的研究可以将这种语料库翻译方法应用于其他语言和临床环境。