Norwegian Centre for E-health Research, Tromsø, Norway.
Department of Informatics, Bioengineering, Robotics and System engineering (DIBRIS), University of Genoa, Genoa, Italy.
AMIA Annu Symp Proc. 2024 Jan 11;2023:456-464. eCollection 2023.
The lack of relevant annotated datasets represents one key limitation in the application of Natural Language Processing techniques in a broad number of tasks, among them Protected Health Information (PHI) identification in Norwegian clinical text. In this work, the possibility of exploiting resources from Swedish, a very closely related language, to Norwegian is explored. The Swedish dataset is annotated with PHI information. Different processing and text augmentation techniques are evaluated, along with their impact in the final performance of the model. The augmentation techniques, such as injection and generation of both Norwegian and Scandinavian Named Entities into the Swedish training corpus, showed to increase the performance in the de-identification task for both Danish and Norwegian text. This trend was also confirmed by the evaluation of model performance on a sample Norwegian gastro surgical clinical text.
缺乏相关的标注数据集是自然语言处理技术在许多任务中应用的一个关键限制,其中包括在挪威临床文本中识别受保护的健康信息 (PHI)。在这项工作中,探索了利用瑞典语资源的可能性,瑞典语与挪威语非常相似。瑞典语数据集使用 PHI 信息进行了标注。评估了不同的处理和文本扩充技术,以及它们对模型最终性能的影响。扩充技术,如将挪威语和斯堪的纳维亚语命名实体注入和生成到瑞典语训练语料库中,显示出对丹麦语和挪威语文本的去识别任务性能的提高。这种趋势也通过对挪威胃肠外科临床文本样本的模型性能评估得到了证实。