Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig University, Germany.
GeMTeX Consortium of the German Medical Informatics Initiative.
Stud Health Technol Inform. 2024 Aug 30;317:171-179. doi: 10.3233/SHTI240853.
The German Medical Text Project (GeMTeX) is one of the largest infrastructure efforts targeting German-language clinical documents. We here introduce the architecture of the de-identification pipeline of GeMTeX.
This pipeline comprises the export of raw clinical documents from the local hospital information system, the import into the annotation platform INCEpTION, fully automatic pre-tagging with protected health information (PHI) items by the Averbis Health Discovery pipeline, a manual curation step of these pre-annotated data, and, finally, the automatic replacement of PHI items with type-conformant substitutes. This design was implemented in a pilot study involving six annotators and two curators each at the Data Integration Centers of the University Hospitals Leipzig and Erlangen.
As a proof of concept, the publicly available Graz Synthetic Text Clinical Corpus (GRASSCO) was enhanced with PHI annotations in an annotation campaign for which reasonable inter-annotator agreement values of Krippendorff's α ≈ 0.97 can be reported.
These curated 1.4 K PHI annotations are released as open-source data constituting the first publicly available German clinical language text corpus with PHI metadata.
德国医学文本项目(GeMTeX)是针对德语临床文档的最大基础设施工作之一。我们在此介绍 GeMTeX 的去识别管道的架构。
该管道包括从本地医院信息系统导出原始临床文档、将其导入注释平台 INCEpTION、使用 Averbis Health Discovery 管道对受保护健康信息(PHI)项进行全自动预标记、对这些预注释数据进行手动整理、最后用符合类型的替代物自动替换 PHI 项。该设计在莱比锡大学医院和埃尔兰根大学医院的数据集成中心的试点研究中得以实现,该研究涉及六位注释者和两位整理者。
作为概念验证,公开可用的格拉茨综合文本临床语料库(GRASSCO)在注释活动中增强了 PHI 注释,该注释活动可报告 Krippendorff 的 α≈0.97 的合理的注释者间一致性值。
这些经过整理的 1.4 K PHI 注释作为开源数据发布,构成了第一个具有 PHI 元数据的公开可用的德语临床语言文本语料库。