• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

去标识化 GRASCCO - 德国医疗文本项目(GeMTeX)语料库去标识化的初步研究。

De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus.

机构信息

Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig University, Germany.

GeMTeX Consortium of the German Medical Informatics Initiative.

出版信息

Stud Health Technol Inform. 2024 Aug 30;317:171-179. doi: 10.3233/SHTI240853.

DOI:10.3233/SHTI240853
PMID:39234720
Abstract

INTRODUCTION

The German Medical Text Project (GeMTeX) is one of the largest infrastructure efforts targeting German-language clinical documents. We here introduce the architecture of the de-identification pipeline of GeMTeX.

METHODS

This pipeline comprises the export of raw clinical documents from the local hospital information system, the import into the annotation platform INCEpTION, fully automatic pre-tagging with protected health information (PHI) items by the Averbis Health Discovery pipeline, a manual curation step of these pre-annotated data, and, finally, the automatic replacement of PHI items with type-conformant substitutes. This design was implemented in a pilot study involving six annotators and two curators each at the Data Integration Centers of the University Hospitals Leipzig and Erlangen.

RESULTS

As a proof of concept, the publicly available Graz Synthetic Text Clinical Corpus (GRASSCO) was enhanced with PHI annotations in an annotation campaign for which reasonable inter-annotator agreement values of Krippendorff's α ≈ 0.97 can be reported.

CONCLUSION

These curated 1.4 K PHI annotations are released as open-source data constituting the first publicly available German clinical language text corpus with PHI metadata.

摘要

简介

德国医学文本项目(GeMTeX)是针对德语临床文档的最大基础设施工作之一。我们在此介绍 GeMTeX 的去识别管道的架构。

方法

该管道包括从本地医院信息系统导出原始临床文档、将其导入注释平台 INCEpTION、使用 Averbis Health Discovery 管道对受保护健康信息(PHI)项进行全自动预标记、对这些预注释数据进行手动整理、最后用符合类型的替代物自动替换 PHI 项。该设计在莱比锡大学医院和埃尔兰根大学医院的数据集成中心的试点研究中得以实现,该研究涉及六位注释者和两位整理者。

结果

作为概念验证,公开可用的格拉茨综合文本临床语料库(GRASSCO)在注释活动中增强了 PHI 注释,该注释活动可报告 Krippendorff 的 α≈0.97 的合理的注释者间一致性值。

结论

这些经过整理的 1.4 K PHI 注释作为开源数据发布,构成了第一个具有 PHI 元数据的公开可用的德语临床语言文本语料库。

相似文献

1
De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus.去标识化 GRASCCO - 德国医疗文本项目(GeMTeX)语料库去标识化的初步研究。
Stud Health Technol Inform. 2024 Aug 30;317:171-179. doi: 10.3233/SHTI240853.
2
De-identification of clinical notes in French: towards a protocol for reference corpus development.法语临床记录的去识别化:迈向参考语料库开发协议
J Biomed Inform. 2014 Aug;50:151-61. doi: 10.1016/j.jbi.2013.12.014. Epub 2013 Dec 29.
3
Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text.评估机器预标注和交互式标注界面在临床文本人工去识别化方面的效果。
J Biomed Inform. 2014 Aug;50:162-72. doi: 10.1016/j.jbi.2014.05.002. Epub 2014 May 20.
4
Announcement of the German Medical Text Corpus Project (GeMTeX).德国医学文本语料库项目(GeMTeX)公告。
Stud Health Technol Inform. 2023 May 18;302:835-836. doi: 10.3233/SHTI230283.
5
Annotating German Clinical Documents for De-Identification.为去识别化标注德国临床文档。
Stud Health Technol Inform. 2019 Aug 21;264:203-207. doi: 10.3233/SHTI190212.
6
A machine learning based approach to identify protected health information in Chinese clinical text.基于机器学习的方法识别中文临床文本中的保护健康信息。
Int J Med Inform. 2018 Aug;116:24-32. doi: 10.1016/j.ijmedinf.2018.05.010. Epub 2018 May 22.
7
Text de-identification for privacy protection: a study of its impact on clinical text information content.用于隐私保护的文本去识别化:对其对临床文本信息内容影响的一项研究
J Biomed Inform. 2014 Aug;50:142-50. doi: 10.1016/j.jbi.2014.01.011. Epub 2014 Feb 3.
8
Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化
BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.
9
GRASCCO - The First Publicly Shareable, Multiply-Alienated German Clinical Text Corpus.GRASCCO-首个公开可分享的、多语料异体的德国临床文本语料库。
Stud Health Technol Inform. 2022 Aug 17;296:66-72. doi: 10.3233/SHTI220805.
10
BoB, a best-of-breed automated text de-identification system for VHA clinical documents.BoB,一种针对 VHA 临床文档的最佳自动文本去识别系统。
J Am Med Inform Assoc. 2013 Jan 1;20(1):77-83. doi: 10.1136/amiajnl-2012-001020. Epub 2012 Sep 4.

引用本文的文献

1
Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data.临床文档语料库——真实语料库、翻译语料库和合成替代语料库,以及各类领域替代语料库:语料库设计多样性调查,重点关注德语文本数据
JAMIA Open. 2025 May 14;8(3):ooaf024. doi: 10.1093/jamiaopen/ooaf024. eCollection 2025 Jun.