Arzideh Kamyar, Baldini Giulia, Winnekens Philipp, Friedrich Christoph M, Nensa Felix, Idrissi-Yaghir Ahmad, Hosch René
Central IT Department, Data Integration Center, University Hospital Essen, Essen, Germany.
Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany.
Appl Clin Inform. 2025 Jan;16(1):31-43. doi: 10.1055/a-2424-1989. Epub 2025 Jan 8.
Commercially available large language models such as Chat Generative Pre-Trained Transformer (ChatGPT) cannot be applied to real patient data for data protection reasons. At the same time, de-identification of clinical unstructured data is a tedious and time-consuming task when done manually. Since transformer models can efficiently process and analyze large amounts of text data, our study aims to explore the impact of a large training dataset on the performance of this task.
We utilized a substantial dataset of 10,240 German hospital documents from 1,130 patients, created as part of the investigating hospital's routine documentation, as training data. Our approach involved fine-tuning and training an ensemble of two transformer-based language models simultaneously to identify sensitive data within our documents. Annotation Guidelines with specific annotation categories and types were created for annotator training.
Performance evaluation on a test dataset of 100 manually annotated documents revealed that our fine-tuned German ELECTRA (gELECTRA) model achieved an F1 macro average score of 0.95, surpassing human annotators who scored 0.93.
We trained and evaluated transformer models to detect sensitive information in German real-world pathology reports and progress notes. By defining an annotation scheme tailored to the documents of the investigating hospital and creating annotation guidelines for staff training, a further experimental study was conducted to compare the models with humans. These results showed that the best-performing model achieved better overall results than two experienced annotators who manually labeled 100 clinical documents.
由于数据保护原因,诸如聊天生成预训练变换器(ChatGPT)等商业可用的大语言模型不能应用于真实患者数据。同时,手动对临床非结构化数据进行去识别是一项繁琐且耗时的任务。由于变换器模型能够高效地处理和分析大量文本数据,我们的研究旨在探讨大型训练数据集对该任务性能的影响。
我们使用了来自1130名患者的10240份德国医院文档的大量数据集,这些文档是作为调查医院的常规文档的一部分创建的,作为训练数据。我们的方法包括同时对两个基于变换器的语言模型进行微调与训练,以识别文档中的敏感数据。为注释者培训创建了具有特定注释类别和类型的注释指南。
在100份手动注释文档的测试数据集上进行的性能评估表明,我们微调后的德国ELECTRA(gELECTRA)模型的F1宏平均得分为0.95,超过了得分0.93的人类注释者。
我们训练并评估了变换器模型,以检测德国真实世界病理报告和病程记录中的敏感信息。通过定义适合调查医院文档的注释方案并为员工培训创建注释指南,我们进行了进一步的实验研究,以将模型与人类进行比较。这些结果表明,性能最佳的模型比两位手动标记100份临床文档的经验丰富的注释者取得了更好的总体结果。