Suppr超能文献

用于大数据分析的德语内容敏感报告的半自动去识别化处理

Semi-automated De-identification of German Content Sensitive Reports for Big Data Analytics.

作者信息

Seuss Hannes, Dankerl Peter, Ihle Matthias, Grandjean Andrea, Hammon Rebecca, Kaestle Nicola, Fasching Peter A, Maier Christian, Christoph Jan, Sedlmayr Martin, Uder Michael, Cavallaro Alexander, Hammon Matthias

机构信息

Department of Radiology, University Hospital Erlangen, Friedrich Alexander Universität (FAU) Erlangen-Nürnberg, Erlangen, Germany.

Text Analytics, Averbis GmbH, Freiburg, Germany.

出版信息

Rofo. 2017 Jul;189(7):661-671. doi: 10.1055/s-0043-102939. Epub 2017 Mar 23.

Abstract

Projects involving collaborations between different institutions require data security via selective de-identification of words or phrases. A semi-automated de-identification tool was developed and evaluated on different types of medical reports natively and after adapting the algorithm to the text structure.  A semi-automated de-identification tool was developed and evaluated for its sensitivity and specificity in detecting sensitive content in written reports. Data from 4671 pathology reports (4105 + 566 in two different formats), 2804 medical reports, 1008 operation reports, and 6223 radiology reports of 1167 patients suffering from breast cancer were de-identified. The content was itemized into four categories: direct identifiers (name, address), indirect identifiers (date of birth/operation, medical ID, etc.), medical terms, and filler words. The software was tested natively (without training) in order to establish a baseline. The reports were manually edited and the model re-trained for the next test set. After manually editing 25, 50, 100, 250, 500 and if applicable 1000 reports of each type re-training was applied.  In the native test, 61.3 % of direct and 80.8 % of the indirect identifiers were detected. The performance (P) increased to 91.4 % (P25), 96.7 % (P50), 99.5 % (P100), 99.6 % (P250), 99.7 % (P500) and 100 % (P1000) for direct identifiers and to 93.2 % (P25), 97.9 % (P50), 97.2 % (P100), 98.9 % (P250), 99.0 % (P500) and 99.3 % (P1000) for indirect identifiers. Without training, 5.3 % of medical terms were falsely flagged as critical data. The performance increased, after training, to 4.0 % (P25), 3.6 % (P50), 4.0 % (P100), 3.7 % (P250), 4.3 % (P500), and 3.1 % (P1000). Roughly 0.1 % of filler words were falsely flagged.  Training of the developed de-identification tool continuously improved its performance. Training with roughly 100 edited reports enables reliable detection and labeling of sensitive data in different types of medical reports.   · Collaborations between different institutions require de-identification of patients' data. · Software-based de-identification of content-sensitive reports grows in importance as a result of 'Big data'. · A de-identification software was developed and tested natively and after training. · The proposed de-identification software worked quite reliably, following training with roughly 100 edited reports. · A final check of the texts by an authorized person remains necessary. · Seuss H, Dankerl P, Ihle M et al. Semi-automated De-identification of German Content Sensitive Reports for Big Data Analytics. Fortschr Röntgenstr 2017; 189: 661 - 671.

摘要

涉及不同机构间合作的项目需要通过对字词或短语进行选择性去标识化处理来确保数据安全。我们开发了一种半自动去标识化工具,并在不同类型的医学报告上进行了评估,包括原始报告以及将算法适配文本结构后的报告。开发了一种半自动去标识化工具,并对其在书面报告中检测敏感内容的敏感性和特异性进行了评估。对1167例乳腺癌患者的4671份病理报告(两种不同格式的4105份 + 566份)、2804份医学报告、1008份手术报告和6223份放射学报告的数据进行了去标识化处理。内容被分为四类:直接标识符(姓名、地址)、间接标识符(出生日期/手术日期、医疗ID等)、医学术语和填充词。该软件在未经过训练的情况下进行了测试(即原生测试),以建立一个基线。报告经过人工编辑后,模型针对下一个测试集进行重新训练。在对每种类型的25份、50份、100份、250份、500份以及(如适用)1000份报告进行人工编辑后,进行了重新训练。在原生测试中,检测到了61.3%的直接标识符和80.8%的间接标识符。对于直接标识符,性能(P)提升至91.4%(P25)、96.7%(P50)、99.5%(P100)、99.6%(P250)、99.7%(P500)和100%(P1000);对于间接标识符,性能提升至93.2%(P25)、97.9%(P50)、97.2%(P100)、98.9%(P250)、99.0%(P500)和99.3%(P1000)。在未经过训练的情况下,5.3%的医学术语被错误地标记为关键数据。经过训练后,性能提升至4.0%(P25)、3.6%(P50)、4.0%(P100)、3.7%(P250)、4.3%(P500)和3.1%(P1000)。大约0.1%的填充词被错误地标记。所开发的去标识化工具的训练持续提高了其性能。使用大约100份编辑后的报告进行训练能够可靠地检测和标记不同类型医学报告中的敏感数据。· 不同机构间的合作需要对患者数据进行去标识化处理。· 由于“大数据”,基于软件的对内容敏感报告的去标识化处理变得越发重要。· 开发了一种去标识化软件,并进行了原生测试和训练后测试。· 所提出的去标识化软件在使用大约100份编辑后的报告进行训练后工作相当可靠。· 仍需要由授权人员对文本进行最终检查。· 修斯H、丹克尔P、伊尔M等。用于大数据分析的德语内容敏感报告的半自动去标识化。《Fortschr Röntgenstr》2017年;189: 661 - 671。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验